RE: [PATCHv3 0/6] Crashdump Accepting Active IOMMU
1 Hi Don, 2 It seems that we agree in the area of keeping the iommu active and using it to 3 isolate the legacy DMA. We also agree that the device's IO driver should be 4 the software that deals with the device (e.g.: resets it, etc.) 5 6 It also seems that there are several differences in our two approaches to how 7 the iommu is then used to accomplish this common goal. 8 9 More detailed replies are among your comments below. 10 11 On Monday 4/7/2014: Don Dutile wrote: 12 On 01/10/2014 05:07 PM, Bill Sumner wrote: 13 v2-v3: 14 1. Commented-out #define DEBUG 1 to eliminate debug messages 15 2. Updated the comments about changes in each version in all patches in the set. 16 3. Fixed: one-line added to Copy-Translations patch to initialize the iovad 17 struct as recommended by Baoquan He [b...@redhat.com] 18 init_iova_domain(domain-iovad, DMA_32BIT_PFN); 19 20 v1-v2: 21 The following series implements a fix for: 22 A kdump problem about DMA that has been discussed for a long time. That is, 23 when a kernel panics and boots into the kdump kernel, DMA started by the 24 panicked kernel is not stopped before the kdump kernel is booted and the 25 kdump kernel disables the IOMMU while this DMA continues. This causes the 26 IOMMU to stop translating the DMA addresses as IOVAs and begin to treat them 27 as physical memory addresses -- which causes the DMA to either: 28 (1) generate DMAR errors or (2) generate PCI SERR errors or (3) transfer 29 data to or from incorrect areas of memory. Often this causes the dump to fail. 30 31 This patch set modifies the behavior of the iommu in the (new) crashdump kernel: 32 1. to accept the iommu hardware in an active state, 33 2. to leave the current translations in-place so that legacy DMA will continue 34 using its current buffers until the device drivers in the crashdump kernel 35 initialize and initialize their devices, 36 3. to use different portions of the iova address ranges for the device drivers 37 in the crashdump kernel than the iova ranges that were in-use at the time 38 of the panic. 39 40 Advantages of this approach: 41 1. All manipulation of the IO-device is done by the Linux device-driver 42 for that device. 43 2. This approach behaves in a manner very similar to operation without an 44 active iommu. 45 46 Sorry to be late to the game finally getting out of a deep hole 47 asked to look at this proposal... 48 49 Along this concept -- similar to operation without an active iommu -- 50 have you considered the following: 51 a) if (this is crash kernel), turn *off* DMAR faults; 52 I did look at turning off DMAR faults and, although I chose not to disable 53 them in my current patch submissions, I think that this is still an open area 54 to explore -- especially for any cases where DMAR faults were already occurring 55 in the panicked kernel. We may want to explore expanding the patches to cover the 56 case where legacy DMAR faults are still occurring in the kdump kernel. 57 I would recommend tackling this as a separate follow-on effort. 58 In testing my patches, I have seen no new DMAR faults introduced by them. 59 In fact, the transition is seamless, leaving no time window where (for 60 instance) RMRR DMA would fail. 61 62 b) if (this is crash kernel), isolate all device DMA in IOMMU 63 I agree, the iommu is a great resource to isolate device DMA into safe memory 64 areas -- we should leave it active and use it. 65 66 b) as second kernel configures each device, have each device to use IOMMU hw-passthrough, 67 i.e., the equivalent of having no IOMMU for the second, kexec'd kernel 68but, having the benefit of keeping all the other (potentially bad) devices 69sequestered / isolated, until they are initialized re-configured in the second kernel, 70 *if at all* -- note: kexec'd kernels may not enable/configure all devices that 71 existed in the first kernel (Bill: I'm sure you know this, but others may not). 72 73 RMRR's that were previously setup could continue to work if they are skipped in step (b), 74 unless the device has gone mad/bad. In that case, re-parsing the RMRR may or may not 75 clear up the issue. 76 77 Additionally, a tidbit of information like some servers force NMI's on DMAR faults, 78 and cause a system reset, thereby, preventing a kdump to occur 79 should have been included as one reason to stop DMAR faults from occurring on kex
Re: [PATCHv3 0/6] Crashdump Accepting Active IOMMU
On 04/08/2014 12:14 PM, David Woodhouse wrote: On Mon, 2014-04-07 at 16:43 -0400, Don Dutile wrote: Additionally, a tidbit of information like some servers force NMI's on DMAR faults, and cause a system reset, thereby, preventing a kdump to occur should have been included as one reason to stop DMAR faults from occurring on kexec-boot, in addition to the fact that a flood of them can lock up a system. How about allocating a physical scratch page, and setting up a mapping for each device such that *every* virtual address (apart from those listed in RMRRs, perhaps) is mapped to that same scratch page? That way you avoid the faults, but you also avoid stray DMA to parts of the system that you don't want to get corrupted. +1... more isolation as second kernel booting sounds good. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCHv3 0/6] Crashdump Accepting Active IOMMU
On Mon, Mar 10, 2014 at 04:43PM, Vivek Goyal wrote: On Fri, Jan 10, 2014 at 03:07:26PM -0700, Bill Sumner wrote: [..] This patch set modifies the behavior of the iommu in the (new) crashdump kernel: 1. to accept the iommu hardware in an active state, 2. to leave the current translations in-place so that legacy DMA will continue using its current buffers until the device drivers in the crashdump kernelinitialize and initialize their devices, 3. to use different portions of the iova address ranges for the device drivers in the crashdump kernel than the iova ranges that were in-use at the time of the panic. Conceptually, above makes sense to me. I have few queries. - Do we need to pass any kind of data from first kernel to second kernel, like table size etc? Calgary IOMMU was using first kernel's tables also and it was determining previous kernel's table size using saved_max_pfn. Yes. We need to pass the intel-iommu translation tables from the first kernel to the second kernel - well, to allow the second kernel to read them. Only the tree of tables that the intel-iommu hardware references are needed: the root-entry table, the context-entry tables, and the page-translation tables. This patch involves only the intel-iommu -- and none of the other iommu hardware types. During the early initialization of the new kdump kernel this patch copies (duplicates) the tree of iommu translation tables from the old kernel into the new kernel. These tables are all linked together as a tree by physical memory addresses. The physical address of the top of the tree -- the root-entry table -- is obtained from a register in the iommu hardware. The copy operation traverses the tree in depth-first order, sanity-checking each table and then duplicating it into the kernel address space of the new kernel, linking the new tables appropriately. If the copy succeeds then the iommu register is updated to point to the copy in the new kernel and the iommu continues translating DMA requests from IO devices. If a sanity-check fails, the patch currently errors-out of the dump (might want to revisit this error case and see if there is something better to do.) By copying the tables into the new kernel, all of the existing code in intel-iommu.c continues to work nearly unchanged, with only a few added initializations of fields when kdump is active and when intel-iommu.c creates its bookkeeping structures for the device -- to make sure they correspond to the contents of the translate tables. I made a determined effort to avoid changing the existing execution path for the non-kdump case, and to minimize the changes even when kdump is active. Note also that this leaves the copy of the translate tables in the old panicked kernel unused and unchanged during the operation of 'makedumpfile', so they appear correctly in the dump as they were at the time of the panic. - I don't know how IOMMU translation tables look like, but are new DMA zones setup by drivers in second kernel part of same table? Yes. The copied translation tables in the kdump kernel contain both the translations that were active at the time of the dump (which will never go away until the dump is finished and the platform is rebooted), plus any translations needed by the kdump kernel (which come and go as necessary). As the bookkeeping structures are created in the new kernel (typically the first time that a device requests a translation for a buffer) they are initialized such that all IOVAs that already exist in the translate tables for tht device are marked as not available for allocation so all requests by device drivers in the new kernel receive IOVAs that were not in-use at the time of the panic. How do we make sure that sufficient space is available. So far I have not seen any problem with running out of space because of this copy; however, I believe that this may be a valid concern with very large IO configurations -- and it deserves some attention as the community reviews and tests this patch. I am not sure if possible table corruption from crashed kernel is an issue here or not. Any corrupted translation tables from the crashed kernel would be a problem that would prevent using copies of them. I hope and expect that this happens rarely -- and that we would catch it during the copy when it does. Fortunately, these tables are only manipulated by the code in intel-iommu.c - which limits the amount of code that has to be right. Of course, there is always the possibility that other code in the kernel could hose one of these tables -- so we need to sanity-check the tables before the new kernel uses them. In general, I think changelogs of these patches need to be little better. There seem to be lot of text and still I can't quickly wrap my head around what a particular patch is supposed to be doing. I will work on the changelogs as I re-arrange the code and the structure of the
FW: Re: [PATCHv3 0/6] Crashdump Accepting Active IOMMU
Resending as plain-text. Bill From: Sumner, William Sent: Wednesday, March 12, 2014 11:15 AM To: 'Vivek Goyal' Cc: dw...@infradead.org; indou.ta...@jp.fujitsu.com; b...@redhat.com; j...@8bytes.org; linux-...@vger.kernel.org; ke...@lists.infradead.org; alex.william...@redhat.com; linux-ker...@vger.kernel.org; iommu@lists.linux-foundation.org; ddut...@redhat.com; Hatch, Douglas B (HPS Linux PM); ishii.hiron...@jp.fujitsu.com; bhelg...@google.com; Li, Zhen-Hua Subject: Re: [PATCHv3 0/6] Crashdump Accepting Active IOMMU On Mon, Mar 10, 2014 at 04:43PM, Vivek Goyal wrote: On Fri, Jan 10, 2014 at 03:07:26PM -0700, Bill Sumner wrote: [..] This patch set modifies the behavior of the iommu in the (new) crashdump kernel: 1. to accept the iommu hardware in an active state, 2. to leave the current translations in-place so that legacy DMA will continue using its current buffers until the device drivers in the crashdump kernel initialize and initialize their devices, 3. to use different portions of the iova address ranges for the device drivers in the crashdump kernel than the iova ranges that were in-use at the time of the panic. Conceptually, above makes sense to me. I have few queries. - Do we need to pass any kind of data from first kernel to second kernel, like table size etc? Calgary IOMMU was using first kernel's tables also and it was determining previous kernel's table size using saved_max_pfn. Yes. We need to pass the intel-iommu translation tables from the first kernel to the second kernel - well, to allow the second kernel to read them. Only the tree of tables that the intel-iommu hardware references are needed: the root-entry table, the context-entry tables, and the page-translation tables. This patch involves only the intel-iommu -- and none of the other iommu hardware types. During the early initialization of the new kdump kernel this patch copies (duplicates) the tree of iommu translation tables from the old kernel into the new kernel. These tables are all linked together as a tree by physical memory addresses. The physical address of the top of the tree -- the root-entry table -- is obtained from a register in the iommu hardware. The copy operation traverses the tree in depth-first order, sanity-checking each table and then duplicating it into the kernel address space of the new kernel, linking the new tables appropriately. If the copy succeeds then the iommu register is updated to point to the copy in the new kernel and the iommu continues translating DMA requests from IO devices. If a sanity-check fails, the patch currently errors-out of the dump (might want to revisit this error case and see if there is something better to do.) By copying the tables into the new kernel, all of the existing code in intel-iommu.c continues to work nearly unchanged, with only a few added initializations of fields when kdump is active and when intel-iommu.c creates its bookkeeping structures for the device -- to make sure they correspond to the contents of the translate tables. I made a determined effort to avoid changing the existing execution path for the non-kdump case, and to minimize the changes even when kdump is active. Note also that this leaves the copy of the translate tables in the old panicked kernel unused and unchanged during the operation of 'makedumpfile', so they appear correctly in the dump as they were at the time of the panic. - I don't know how IOMMU translation tables look like, but are new DMA zones setup by drivers in second kernel part of same table? Yes. The copied translation tables in the kdump kernel contain both the translations that were active at the time of the dump (which will never go away until the dump is finished and the platform is rebooted), plus any translations needed by the kdump kernel (which come and go as necessary). As the bookkeeping structures are created in the new kernel (typically the first time that a device requests a translation for a buffer) they are initialized such that all IOVAs that already exist in the translate tables for tht device are marked as not available for allocation so all requests by device drivers in the new kernel receive IOVAs that were not in-use at the time of the panic. How do we make sure that sufficient space is available. So far I have not seen any problem with running out of space because of this copy; however, I believe that this may be a valid concern with very large IO configurations -- and it deserves some attention as the community reviews and tests this patch. I am not sure if possible table corruption from crashed kernel is an issue here or not. Any corrupted translation tables from the crashed kernel would be a problem that would prevent using copies of them. I hope and expect that this happens rarely -- and that we would catch it during the copy when it does. Fortunately, these tables are only manipulated by the code in intel
RE: [PATCHv3 0/6] Crashdump Accepting Active IOMMU
Hi Joerg, On Tue 3/4/2014 9:06 AM, Joerg Roedel wrote: On Fri, Jan 10, 2014 at 03:07:26PM -0700, Bill Sumner wrote: Bill Sumner (6): Crashdump-Accepting-Active-IOMMU-Flags-and-Prototype Crashdump-Accepting-Active-IOMMU-Utility-functions Crashdump-Accepting-Active-IOMMU-Domain-Interfaces Crashdump-Accepting-Active-IOMMU-Copy-Translations Crashdump-Accepting-Active-IOMMU-Debug-Print-IOMMU Crashdump-Accepting-Active-IOMMU-Call-From-Mainline drivers/iommu/intel-iommu.c | 1293 --- 1 file changed, 1225 insertions(+), 68 deletions(-) This is a pretty big change, before merging I would like to get more eyes on it. Please make sure you get more reviews on the code and the general concept. A few things I noticed while looking at it: Yes, as I was developing this, I realized that it is far more code than most Linux submissions. In addition to more eyes reviewing the code, I hope to see additional members of the community trying it out -- testing it on a wide variety of system configurations. Let me encourage additional reviewers and testers to look at this fix to crashdump. * You put a lot of code into #define CONFIG_CRASH_DUMP. Isn't the same approach necessary to for KEXEC? Until I saw your email, I had been focused only on the immediate problem of fixing crashdump and had not considered using this approach in the more-general context of KEXEC. It certainly looks like it could be useful with KEXEC as well. I would like to take-on the effort of expanding this technique into KEXEC as a separate follow-on project so that the original fix crashdump project is not delayed. On a quick look, I believe that the amount of additional code would be minimal; but the expansion of the testing matrix might be quite large. * Please get rid of all the pr_debug stuff. Will do. * I think it makes sense to put the functions to access the IOMMU configuration values of the old kernel into a new file. This is better than adding over 1000 new lines to intel-iommu.c Sounds good. I will move these functions into a new file. * I have seen your new single-patch for this change. A single patch with more than 1200 changed lines is not something anyone is willing to review. Please split the patches again in a meaningful and bisectable way. I will return to a multiple-patch set for future submissions. -- Bill ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCHv3 0/6] Crashdump Accepting Active IOMMU
On Fri, Jan 10, 2014 at 03:07:26PM -0700, Bill Sumner wrote: [..] This patch set modifies the behavior of the iommu in the (new) crashdump kernel: 1. to accept the iommu hardware in an active state, 2. to leave the current translations in-place so that legacy DMA will continue using its current buffers until the device drivers in the crashdump kernel initialize and initialize their devices, 3. to use different portions of the iova address ranges for the device drivers in the crashdump kernel than the iova ranges that were in-use at the time of the panic. Conceptually, above makes sense to me. I have few queries. - Do we need to pass any kind of data from first kernel to second kernel, like table size etc? Calgary IOMMU was using first kernel's tables also and it was determining previous kernel's table size using saved_max_pfn. - I don't know how IOMMU translation tables look like, but are new DMA zones setup by drivers in second kernel part of same table? How do we make sure that sufficient space is available. I am not sure if possible table corruption from crashed kernel is an issue here or not. In general, I think changelogs of these patches need to be little better. There seem to be lot of text and still I can't quickly wrap my head around what a particular patch is supposed to be doing. But we definitely need to fix this issue. IOMMU issues with kdump have been troubling us for a very long time. Thanks Vivek ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCHv3 0/6] Crashdump Accepting Active IOMMU
Hi Bill, On Fri, Jan 10, 2014 at 03:07:26PM -0700, Bill Sumner wrote: Bill Sumner (6): Crashdump-Accepting-Active-IOMMU-Flags-and-Prototype Crashdump-Accepting-Active-IOMMU-Utility-functions Crashdump-Accepting-Active-IOMMU-Domain-Interfaces Crashdump-Accepting-Active-IOMMU-Copy-Translations Crashdump-Accepting-Active-IOMMU-Debug-Print-IOMMU Crashdump-Accepting-Active-IOMMU-Call-From-Mainline drivers/iommu/intel-iommu.c | 1293 --- 1 file changed, 1225 insertions(+), 68 deletions(-) This is a pretty big change, before merging I would like to get more eyes on it. Please make sure you get more reviews on the code and the general concept. A few things I noticed while looking at it: * You put a lot of code into #define CONFIG_CRASH_DUMP. Isn't the same approach necessary to for KEXEC? * Please get rid of all the pr_debug stuff. * I think it makes sense to put the functions to access the IOMMU configuration values of the old kernel into a new file. This is better than adding over 1000 new lines to intel-iommu.c * I have seen your new single-patch for this change. A single patch with more than 1200 changed lines is not something anyone is willing to review. Please split the patches again in a meaningful and bisectable way. Thanks, Joerg ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCHv3 0/6] Crashdump Accepting Active IOMMU
Tested this patchset on my local HP Z420 workstation, and it works very well. Hi Bill, Thanks for your effort. There are several concerns from me. Firstly, I think the patch log need be rearanged. Patchset cover letter can contain information to express why, how briefly. If you think this is very useful, it can be split and put into patch log. Then for each patch, patch log should be accurate and summary to describe why and how this patch really does. If you feel several patches have the corelation, they may need to be merged. Secondly, each patch could get a seperate subject which tells what this patch really does. Even they are merged to kernel git tree, each of them is a independent commit. People can take to use or depend only one of them. Actually, I don't like current patch subject. Thirdly, this patchset will be part of intel-iommu, though they only works for kdump kernel. As a subsystem, the style need be consistent. I like the debug method which introduces a struct pr_debug, however maintainers may not like it. Because a debug utility may bloat code and affect people's review. Personally I like refined code, the less the easier to review. Or put it as a independent patch at the end of the patchset, let maintainer decide whether it's OK to pull in. Sorry to say so much, I think this solution is truly the right way. As you know, it's a big problem for kdump when intel-iommu is active in 1st kernel. Because of this bug, many machines with intel-iommu have to be set intel-iommu=off, the performance is affected very much. Baoquan Thanks On 01/10/14 at 03:07pm, Bill Sumner wrote: v2-v3: 1. Commented-out #define DEBUG 1 to eliminate debug messages 2. Updated the comments about changes in each version in all patches in the set. 3. Fixed: one-line added to Copy-Translations patch to initialize the iovad struct as recommended by Baoquan He [b...@redhat.com] init_iova_domain(domain-iovad, DMA_32BIT_PFN); v1-v2: The following series implements a fix for: A kdump problem about DMA that has been discussed for a long time. That is, when a kernel panics and boots into the kdump kernel, DMA started by the panicked kernel is not stopped before the kdump kernel is booted and the kdump kernel disables the IOMMU while this DMA continues. This causes the IOMMU to stop translating the DMA addresses as IOVAs and begin to treat them as physical memory addresses -- which causes the DMA to either: (1) generate DMAR errors or (2) generate PCI SERR errors or (3) transfer data to or from incorrect areas of memory. Often this causes the dump to fail. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu