Re: [PATCH] mm/mapping_dirty_helpers: Guard hugepage pud's usage
On 4/8/21 9:15 PM, Zack Rusin wrote: Lets make sure we don't use pud hugepage helpers on architectures which do not support it. This fixes the code on arm64. nits: Perhaps be a little more specific about what it fixes? I figure it's a compilation failure? Also please use imperative form: "Fix the code arm64" rather than "This fixes the code on arm64" Other than that LGTM. Reviewed-by: Thomas Hellström (Intel) Signed-off-by: Zack Rusin Cc: Andrew Morton Cc: Thomas Hellström (Intel) Cc: linux...@kvack.org Cc: linux-kernel@vger.kernel.org --- mm/mapping_dirty_helpers.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/mm/mapping_dirty_helpers.c b/mm/mapping_dirty_helpers.c index b59054ef2e10..b890854ec761 100644 --- a/mm/mapping_dirty_helpers.c +++ b/mm/mapping_dirty_helpers.c @@ -165,10 +165,12 @@ static int wp_clean_pud_entry(pud_t *pud, unsigned long addr, unsigned long end, return 0; } +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD /* Huge pud */ walk->action = ACTION_CONTINUE; if (pud_trans_huge(pudval) || pud_devmap(pudval)) WARN_ON(pud_write(pudval) || pud_dirty(pudval)); +#endif return 0; }
Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
On 3/24/21 10:58 AM, Daniel Vetter wrote: On Tue, Mar 23, 2021 at 09:42:18PM +0100, Thomas Hellström (Intel) wrote: On 3/23/21 8:52 PM, Williams, Dan J wrote: On Sun, 2021-03-21 at 19:45 +0100, Thomas Hellström (Intel) wrote: TTM sets up huge page-table-entries both to system- and device memory, and we don't want gup to assume there are always valid backing struct pages for these. For PTEs this is handled by setting the pte_special bit, but for the huge PUDs and PMDs, we have neither pmd_special nor pud_special. Normally, huge TTM entries are identified by looking at vma_is_special_huge(), but fast gup can't do that, so as an alternative define _devmap entries for which there are no backing dev_pagemap as special, update documentation and make huge TTM entries _devmap, after verifying that there is no backing dev_pagemap. Please do not abuse p{m,u}d_devmap like this. I'm in the process of removing get_devpagemap() from the gup-fast path [1]. Instead there should be space for p{m,u}d_special in the page table entries (at least for x86-64). So the fix is to remove that old assumption that huge pages can never be special. [1]: http://lore.kernel.org/r/161604050866.1463742.7759521510383551055.st...@dwillia2-desk3.amr.corp.intel.com Hmm, yes with that patch it will obviously not work as intended. Given that, I think we'll need to disable the TTM huge pages for now until we can sort out and agree on using a page table entry bit. Yeah :-/ I think going full pud/pmd_mkspecial should then also mesh well with Jason's request to wrap it all up into a vmf_insert_* helper, so at least it would all look rather pretty in the end. Yes, I agree. Seems like the special (SW1) is available also for huge page table entries on x86 AFAICT, although just not implemented. Otherwise the SW bits appear completely used up. The PTE size vmf_insert_pfn__xxx functions either insert one of devmap or special. I think the only users of the huge insert functions apart form TTM currently insert devmap so we should probably be able to do the same, and then DRM / TTM wouldn't need to care at all about special or not. /Thomas -Daniel
Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
On 3/24/21 1:24 PM, Jason Gunthorpe wrote: On Wed, Mar 24, 2021 at 10:56:43AM +0100, Daniel Vetter wrote: On Tue, Mar 23, 2021 at 06:06:53PM +0100, Thomas Hellström (Intel) wrote: On 3/23/21 5:37 PM, Jason Gunthorpe wrote: On Tue, Mar 23, 2021 at 05:34:51PM +0100, Thomas Hellström (Intel) wrote: @@ -210,6 +211,20 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault *vmf, if ((pfn & (fault_page_size - 1)) != 0) goto out_fallback; + /* +* Huge entries must be special, that is marking them as devmap +* with no backing device map range. If there is a backing +* range, Don't insert a huge entry. +* If this check turns out to be too much of a performance hit, +* we can instead have drivers indicate whether they may have +* backing device map ranges and if not, skip this lookup. +*/ I think we can do this statically: - if it's system memory we know there's no devmap for it, and we do the trick to block gup_fast Yes, that should work. - if it's iomem, we know gup_fast wont work anyway if don't set PFN_DEV, so might as well not do that I think gup_fast will unfortunately mistake a huge iomem page for an ordinary page and try to access a non-existant struct page for it, unless we do the devmap trick. And the lookup would then be for the rare case where a driver would have already registered a dev_pagemap for an iomem area which may also be mapped through TTM (like the patch from Felix a couple of weeks ago). If a driver can promise not to do that, then we can safely remove the lookup. Isn't the devmap PTE flag arch optional? Does this fall back to not using huge pages on arches that don't support it? Good point. No, currently it's only conditioned on transhuge page support. Need to condition it on also devmap support. Also, I feel like this code to install "pte_special" huge pages does not belong in the drm subsystem.. I could add helpers in huge_memory.c: vmf_insert_pfn_pmd_prot_special() and vmf_insert_pfn_pud_prot_special() The somewhat annoying thing is that we'd need an error code so we fall back to pte fault handling. That's at least my understanding of how pud/pmd fault handling works. Not sure how awkward that is going to be with the overall fault handling flow. But aside from that I think this makes tons of sense. Why should the driver be so specific? vmf_insert_pfn_range_XXX() And it will figure out the optimal way to build the page tables. Driver should provide the largest physically contiguous range it can I figure that would probably work, but since the huge_fault() interface is already providing the size of the fault based on how the pagetable is currently populated I figure that would have to move a lot of that logic into that helper... /Thomas Jason
Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
On 3/24/21 1:41 PM, Jason Gunthorpe wrote: On Wed, Mar 24, 2021 at 01:35:17PM +0100, Thomas Hellström (Intel) wrote: On 3/24/21 1:24 PM, Jason Gunthorpe wrote: On Wed, Mar 24, 2021 at 10:56:43AM +0100, Daniel Vetter wrote: On Tue, Mar 23, 2021 at 06:06:53PM +0100, Thomas Hellström (Intel) wrote: On 3/23/21 5:37 PM, Jason Gunthorpe wrote: On Tue, Mar 23, 2021 at 05:34:51PM +0100, Thomas Hellström (Intel) wrote: @@ -210,6 +211,20 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault *vmf, if ((pfn & (fault_page_size - 1)) != 0) goto out_fallback; + /* +* Huge entries must be special, that is marking them as devmap +* with no backing device map range. If there is a backing +* range, Don't insert a huge entry. +* If this check turns out to be too much of a performance hit, +* we can instead have drivers indicate whether they may have +* backing device map ranges and if not, skip this lookup. +*/ I think we can do this statically: - if it's system memory we know there's no devmap for it, and we do the trick to block gup_fast Yes, that should work. - if it's iomem, we know gup_fast wont work anyway if don't set PFN_DEV, so might as well not do that I think gup_fast will unfortunately mistake a huge iomem page for an ordinary page and try to access a non-existant struct page for it, unless we do the devmap trick. And the lookup would then be for the rare case where a driver would have already registered a dev_pagemap for an iomem area which may also be mapped through TTM (like the patch from Felix a couple of weeks ago). If a driver can promise not to do that, then we can safely remove the lookup. Isn't the devmap PTE flag arch optional? Does this fall back to not using huge pages on arches that don't support it? Good point. No, currently it's only conditioned on transhuge page support. Need to condition it on also devmap support. Also, I feel like this code to install "pte_special" huge pages does not belong in the drm subsystem.. I could add helpers in huge_memory.c: vmf_insert_pfn_pmd_prot_special() and vmf_insert_pfn_pud_prot_special() The somewhat annoying thing is that we'd need an error code so we fall back to pte fault handling. That's at least my understanding of how pud/pmd fault handling works. Not sure how awkward that is going to be with the overall fault handling flow. But aside from that I think this makes tons of sense. Why should the driver be so specific? vmf_insert_pfn_range_XXX() And it will figure out the optimal way to build the page tables. Driver should provide the largest physically contiguous range it can I figure that would probably work, but since the huge_fault() interface is already providing the size of the fault based on how the pagetable is currently populated I figure that would have to move a lot of that logic into that helper... But we don't really care about the size of the fault when we stuff the pfns. The device might use it when handling the fault, but once the fault is handled the device knows what the contiguous pfn range is that it has available to stuff into the page tables, it just tells the vmf_insert what it was able to create, and it creates the necessary page table structure. The size of the hole in the page table is really only advisory, the device may not want to make a 2M or 1G page entry and may prefer to only create 4k. In an ideal world the creation/destruction of page table levels would by dynamic at this point, like THP. Hmm, but I'm not sure what problem we're trying to solve by changing the interface in this way? Currently if the core vm requests a huge pud, we give it one, and if we can't or don't want to (because of dirty-tracking, for example, which is always done on 4K page-level) we just return VM_FAULT_FALLBACK, and the fault is retried at a lower level. Also, determining whether we have a contigous range is not free, so we don't want to do that unnecessarily. /Thomas
Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
On 3/24/21 2:48 PM, Jason Gunthorpe wrote: On Wed, Mar 24, 2021 at 02:35:38PM +0100, Thomas Hellström (Intel) wrote: In an ideal world the creation/destruction of page table levels would by dynamic at this point, like THP. Hmm, but I'm not sure what problem we're trying to solve by changing the interface in this way? We are trying to make a sensible driver API to deal with huge pages. Currently if the core vm requests a huge pud, we give it one, and if we can't or don't want to (because of dirty-tracking, for example, which is always done on 4K page-level) we just return VM_FAULT_FALLBACK, and the fault is retried at a lower level. Well, my thought would be to move the pte related stuff into vmf_insert_range instead of recursing back via VM_FAULT_FALLBACK. I don't know if the locking works out, but it feels cleaner that the driver tells the vmf how big a page it can stuff in, not the vm telling the driver to stuff in a certain size page which it might not want to do. Some devices want to work on a in-between page size like 64k so they can't form 2M pages but they can stuff 64k of 4K pages in a batch on every fault. Hmm, yes, but we would in that case be limited anyway to insert ranges smaller than and equal to the fault size to avoid extensive and possibly unnecessary checks for contigous memory. And then if we can't support the full fault size, we'd need to either presume a size and alignment of the next level or search for contigous memory in both directions around the fault address, perhaps unnecessarily as well. I do think the current interface works ok, as we're just acting on what the core vm tells us to do. /Thomas That idea doesn't fit naturally if the VM is driving the size. Jason
Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
On 3/24/21 7:31 PM, Christian König wrote: Am 24.03.21 um 17:38 schrieb Jason Gunthorpe: On Wed, Mar 24, 2021 at 04:50:14PM +0100, Thomas Hellström (Intel) wrote: On 3/24/21 2:48 PM, Jason Gunthorpe wrote: On Wed, Mar 24, 2021 at 02:35:38PM +0100, Thomas Hellström (Intel) wrote: In an ideal world the creation/destruction of page table levels would by dynamic at this point, like THP. Hmm, but I'm not sure what problem we're trying to solve by changing the interface in this way? We are trying to make a sensible driver API to deal with huge pages. Currently if the core vm requests a huge pud, we give it one, and if we can't or don't want to (because of dirty-tracking, for example, which is always done on 4K page-level) we just return VM_FAULT_FALLBACK, and the fault is retried at a lower level. Well, my thought would be to move the pte related stuff into vmf_insert_range instead of recursing back via VM_FAULT_FALLBACK. I don't know if the locking works out, but it feels cleaner that the driver tells the vmf how big a page it can stuff in, not the vm telling the driver to stuff in a certain size page which it might not want to do. Some devices want to work on a in-between page size like 64k so they can't form 2M pages but they can stuff 64k of 4K pages in a batch on every fault. Hmm, yes, but we would in that case be limited anyway to insert ranges smaller than and equal to the fault size to avoid extensive and possibly unnecessary checks for contigous memory. Why? The insert function is walking the page tables, it just updates things as they are. It learns the arragement for free while doing the walk. The device has to always provide consistent data, if it overlaps into pages that are already populated that is fine so long as it isn't changing their addresses. And then if we can't support the full fault size, we'd need to either presume a size and alignment of the next level or search for contigous memory in both directions around the fault address, perhaps unnecessarily as well. You don't really need to care about levels, the device should be faulting in the largest memory regions it can within its efficiency. If it works on 4M pages then it should be faulting 4M pages. The page size of the underlying CPU doesn't really matter much other than some tuning to impact how the device's allocator works. Yes, but then we'd be adding a lot of complexity into this function that is already provided by the current interface for DAX, for little or no gain, at least in the drm/ttm setting. Please think of the following situation: You get a fault, you do an extensive time-consuming scan of your VRAM buffer object into which the fault goes and determine you can fault 1GB. Now you hand it to vmf_insert_range() and because the user-space address is misaligned, or already partly populated because of a previous eviction, you can only fault single pages, and you end up faulting a full GB of single pages perhaps for a one-time small update. On top of this, unless we want to do the walk trying increasingly smaller sizes of vmf_insert_xxx(), we'd have to use apply_to_page_range() and teach it about transhuge page table entries, because pagewalk.c can't be used (It can't populate page tables). That also means apply_to_page_range() needs to be complicated with page table locks since transhuge pages aren't stable and can be zapped and refaulted under us while we do the walk. On top of this, the user-space address allocator needs to know how large gpu pages are aligned in buffer objects to have a reasonable chance of aligning with CPU huge page boundaries which is a requirement to be able to insert a huge CPU page table entry, so the driver would basically need the drm helper that can do this alignment anyway. All this makes me think we should settle for the current interface for now, and if someone feels like refining it, I'm fine with that. After all, this isn't a strange drm/ttm invention, it's a pre-existing interface that we reuse. I agree with Jason here. We get the best efficiency when we look at the what the GPU driver provides and make sure that we handle one GPU page at once instead of looking to much into what the CPU is doing with it's page tables. At least one AMD GPUs the GPU page size can be anything between 4KiB and 2GiB and if we will in a 2GiB chunk at once this can in theory be handled by just two giant page table entries on the CPU side. Yes, but I fail to see why, with the current code, we can't do this (save the refcounting bug)? /Thomas
Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
On 3/24/21 5:34 PM, Dave Hansen wrote: On 3/24/21 3:05 AM, Thomas Hellström (Intel) wrote: Yes, I agree. Seems like the special (SW1) is available also for huge page table entries on x86 AFAICT, although just not implemented. Otherwise the SW bits appear completely used up. Although the _PAGE_BIT_SOFTW* bits are used up, there's plenty of room in the hardware PTEs. Bits 52->58 are software-available, and we're only using 58 at the moment. We also have not been careful at *all* about how _PAGE_BIT_SOFTW* are used. It's quite possible we can encode another use even in the existing bits. Personally, I'd just try: #define _PAGE_BIT_SOFTW557 /* available for programmer */ OK, I'll follow your advise here. FWIW I grepped for SW1 and it seems used in a selftest, but only for PTEs AFAICT. Oh, and we don't care about 32-bit much anymore? /Thomas
Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
On 3/25/21 12:14 AM, Jason Gunthorpe wrote: On Wed, Mar 24, 2021 at 09:07:53PM +0100, Thomas Hellström (Intel) wrote: On 3/24/21 7:31 PM, Christian König wrote: Am 24.03.21 um 17:38 schrieb Jason Gunthorpe: On Wed, Mar 24, 2021 at 04:50:14PM +0100, Thomas Hellström (Intel) wrote: On 3/24/21 2:48 PM, Jason Gunthorpe wrote: On Wed, Mar 24, 2021 at 02:35:38PM +0100, Thomas Hellström (Intel) wrote: In an ideal world the creation/destruction of page table levels would by dynamic at this point, like THP. Hmm, but I'm not sure what problem we're trying to solve by changing the interface in this way? We are trying to make a sensible driver API to deal with huge pages. Currently if the core vm requests a huge pud, we give it one, and if we can't or don't want to (because of dirty-tracking, for example, which is always done on 4K page-level) we just return VM_FAULT_FALLBACK, and the fault is retried at a lower level. Well, my thought would be to move the pte related stuff into vmf_insert_range instead of recursing back via VM_FAULT_FALLBACK. I don't know if the locking works out, but it feels cleaner that the driver tells the vmf how big a page it can stuff in, not the vm telling the driver to stuff in a certain size page which it might not want to do. Some devices want to work on a in-between page size like 64k so they can't form 2M pages but they can stuff 64k of 4K pages in a batch on every fault. Hmm, yes, but we would in that case be limited anyway to insert ranges smaller than and equal to the fault size to avoid extensive and possibly unnecessary checks for contigous memory. Why? The insert function is walking the page tables, it just updates things as they are. It learns the arragement for free while doing the walk. The device has to always provide consistent data, if it overlaps into pages that are already populated that is fine so long as it isn't changing their addresses. And then if we can't support the full fault size, we'd need to either presume a size and alignment of the next level or search for contigous memory in both directions around the fault address, perhaps unnecessarily as well. You don't really need to care about levels, the device should be faulting in the largest memory regions it can within its efficiency. If it works on 4M pages then it should be faulting 4M pages. The page size of the underlying CPU doesn't really matter much other than some tuning to impact how the device's allocator works. Yes, but then we'd be adding a lot of complexity into this function that is already provided by the current interface for DAX, for little or no gain, at least in the drm/ttm setting. Please think of the following situation: You get a fault, you do an extensive time-consuming scan of your VRAM buffer object into which the fault goes and determine you can fault 1GB. Now you hand it to vmf_insert_range() and because the user-space address is misaligned, or already partly populated because of a previous eviction, you can only fault single pages, and you end up faulting a full GB of single pages perhaps for a one-time small update. Why would "you can only fault single pages" ever be true? If you have 1GB of pages then the vmf_insert_range should allocate enough page table entries to consume it, regardless of alignment. Ah yes, What I meant was you can only insert PTE size entries, either because of misalignment or because the page-table is alredy pre-populated with pmd size page directories, which you can't remove with only the read side of the mmap lock held. And why shouldn't DAX switch to this kind of interface anyhow? It is basically exactly the same problem. The underlying filesystem block size is *not* necessarily aligned to the CPU page table sizes and DAX would benefit from better handling of this mismatch. First, I think we must sort out what "better handling" means. This is my takeout of the discussion so far: Claimed Pros: of vmf_insert_range() * We get an interface that doesn't require knowledge of CPU page table entry level sizes. * We get the best efficiency when we look at what the GPU driver provides. (I disagree on this one). Claimed Cons: * A new implementation that may get complicated particularly if it involves modifying all of the DAX code * The driver would have to know about those sizes anyway to get alignment right (Applies to DRM, because we mmap buffer objects, not physical address ranges. But not to DAX AFAICT), * We loose efficiency, because we are prepared to spend an extra effort for alignment- and continuity checks when we know we can insert a huge page table entry, but not if we know we can't * We loose efficiency because we might unnecessarily prefault a number of PTE size page-table entries (really a special case of the above one). Now in the context of quickly fixing a critical bug, the choice IMHO becomes eas
Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
On 3/25/21 9:27 AM, Christian König wrote: Am 25.03.21 um 08:48 schrieb Thomas Hellström (Intel): On 3/25/21 12:14 AM, Jason Gunthorpe wrote: On Wed, Mar 24, 2021 at 09:07:53PM +0100, Thomas Hellström (Intel) wrote: On 3/24/21 7:31 PM, Christian König wrote: Am 24.03.21 um 17:38 schrieb Jason Gunthorpe: On Wed, Mar 24, 2021 at 04:50:14PM +0100, Thomas Hellström (Intel) wrote: On 3/24/21 2:48 PM, Jason Gunthorpe wrote: On Wed, Mar 24, 2021 at 02:35:38PM +0100, Thomas Hellström (Intel) wrote: In an ideal world the creation/destruction of page table levels would by dynamic at this point, like THP. Hmm, but I'm not sure what problem we're trying to solve by changing the interface in this way? We are trying to make a sensible driver API to deal with huge pages. Currently if the core vm requests a huge pud, we give it one, and if we can't or don't want to (because of dirty-tracking, for example, which is always done on 4K page-level) we just return VM_FAULT_FALLBACK, and the fault is retried at a lower level. Well, my thought would be to move the pte related stuff into vmf_insert_range instead of recursing back via VM_FAULT_FALLBACK. I don't know if the locking works out, but it feels cleaner that the driver tells the vmf how big a page it can stuff in, not the vm telling the driver to stuff in a certain size page which it might not want to do. Some devices want to work on a in-between page size like 64k so they can't form 2M pages but they can stuff 64k of 4K pages in a batch on every fault. Hmm, yes, but we would in that case be limited anyway to insert ranges smaller than and equal to the fault size to avoid extensive and possibly unnecessary checks for contigous memory. Why? The insert function is walking the page tables, it just updates things as they are. It learns the arragement for free while doing the walk. The device has to always provide consistent data, if it overlaps into pages that are already populated that is fine so long as it isn't changing their addresses. And then if we can't support the full fault size, we'd need to either presume a size and alignment of the next level or search for contigous memory in both directions around the fault address, perhaps unnecessarily as well. You don't really need to care about levels, the device should be faulting in the largest memory regions it can within its efficiency. If it works on 4M pages then it should be faulting 4M pages. The page size of the underlying CPU doesn't really matter much other than some tuning to impact how the device's allocator works. Yes, but then we'd be adding a lot of complexity into this function that is already provided by the current interface for DAX, for little or no gain, at least in the drm/ttm setting. Please think of the following situation: You get a fault, you do an extensive time-consuming scan of your VRAM buffer object into which the fault goes and determine you can fault 1GB. Now you hand it to vmf_insert_range() and because the user-space address is misaligned, or already partly populated because of a previous eviction, you can only fault single pages, and you end up faulting a full GB of single pages perhaps for a one-time small update. Why would "you can only fault single pages" ever be true? If you have 1GB of pages then the vmf_insert_range should allocate enough page table entries to consume it, regardless of alignment. Ah yes, What I meant was you can only insert PTE size entries, either because of misalignment or because the page-table is alredy pre-populated with pmd size page directories, which you can't remove with only the read side of the mmap lock held. Please explain that further. Why do we need the mmap lock to insert PMDs but not when insert PTEs? We don't. But once you've inserted a PMD directory you can't remove it unless you have the mmap lock (and probably also the i_mmap_lock in write mode). That for example means that if you have a VRAM region mapped with huge PMDs, and then it gets evicted, and you happen to read a byte from it when it's evicted and therefore populate the full region with PTEs pointing to system pages, you can't go back to huge PMDs again without a munmap() in between. And why shouldn't DAX switch to this kind of interface anyhow? It is basically exactly the same problem. The underlying filesystem block size is *not* necessarily aligned to the CPU page table sizes and DAX would benefit from better handling of this mismatch. First, I think we must sort out what "better handling" means. This is my takeout of the discussion so far: Claimed Pros: of vmf_insert_range() * We get an interface that doesn't require knowledge of CPU page table entry level sizes. * We get the best efficiency when we look at what the GPU driver provides. (I disagree on this one). Claimed Cons:
Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
On 3/25/21 12:30 PM, Jason Gunthorpe wrote: On Thu, Mar 25, 2021 at 10:51:35AM +0100, Thomas Hellström (Intel) wrote: Please explain that further. Why do we need the mmap lock to insert PMDs but not when insert PTEs? We don't. But once you've inserted a PMD directory you can't remove it unless you have the mmap lock (and probably also the i_mmap_lock in write mode). That for example means that if you have a VRAM region mapped with huge PMDs, and then it gets evicted, and you happen to read a byte from it when it's evicted and therefore populate the full region with PTEs pointing to system pages, you can't go back to huge PMDs again without a munmap() in between. This is all basically magic to me still, but THP does this transformation and I think what it does could work here too. We probably wouldn't be able to upgrade while handling fault, but at the same time, this should be quite rare as it would require the driver to have supplied a small page for this VMA at some point. IIRC THP handles this using khugepaged, grabbing the lock in write mode when coalescing, and yeah, I don't think anything prevents anyone from extending khugepaged doing that also for special huge page table entries. Apart from that I still don't fully get why we need this in the first place. Because virtual huge page address boundaries need to be aligned with physical huge page address boundaries, and mmap can happen before bos are populated so you have no way of knowing how physical huge page address But this is a mmap-time problem, fault can't fix mmap using the wrong VA. Nope. The point here was that in this case, to make sure mmap uses the correct VA to give us a reasonable chance of alignement, the driver might need to be aware of and do trickery with the huge page-table-entry sizes anyway, although I think in most cases a standard helper for this can be supplied. /Thomas I really don't see that either. When a buffer is accessed by the CPU it is in > 90% of all cases completely accessed. Not faulting in full ranges is just optimizing for a really unlikely case here. It might be that you're right, but are all drivers wanting to use this like drm in this respect? Using the interface to fault in a 1G range in the hope it could map it to a huge pud may unexpectedly consume and populate some 16+ MB of page tables. If the underlying device block size is so big then sure, why not? The "unexpectedly" should be quite rare/non existant anyhow. Jason
Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
On 3/25/21 1:09 PM, Christian König wrote: Am 25.03.21 um 13:01 schrieb Jason Gunthorpe: On Thu, Mar 25, 2021 at 12:53:15PM +0100, Thomas Hellström (Intel) wrote: Nope. The point here was that in this case, to make sure mmap uses the correct VA to give us a reasonable chance of alignement, the driver might need to be aware of and do trickery with the huge page-table-entry sizes anyway, although I think in most cases a standard helper for this can be supplied. Of course the driver needs some way to influence the VA mmap uses, gernally it should align to the natural page size of the device Well a mmap() needs to be aligned to the page size of the CPU, but not necessarily to the one of the device. So I'm pretty sure the device driver should not be involved in any way the choosing of the VA for the CPU mapping. Christian. We've had this discussion before and at that time I managed to convince you by pointing to the shmem helper for this, shmem_get_umapped_area(). Basically there are two ways to do this. Either use a standard helper similar to shmem's, and then the driver needs to align physical (device) huge page boundaries to address space offset huge page boundaries. If you don't do that you can just as well use a custom function that adjusts for you not doing that (drm_get_unmapped_area()). Both require driver knowledge of the size of huge pages. Without a function to adjust, mmap will use it's default (16 byte?) alignment and chance of alignment becomes very small. /Thomas Jason
Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
Hi, On 3/25/21 2:02 PM, Christian König wrote: Am 25.03.21 um 13:36 schrieb Thomas Hellström (Intel): On 3/25/21 1:09 PM, Christian König wrote: Am 25.03.21 um 13:01 schrieb Jason Gunthorpe: On Thu, Mar 25, 2021 at 12:53:15PM +0100, Thomas Hellström (Intel) wrote: Nope. The point here was that in this case, to make sure mmap uses the correct VA to give us a reasonable chance of alignement, the driver might need to be aware of and do trickery with the huge page-table-entry sizes anyway, although I think in most cases a standard helper for this can be supplied. Of course the driver needs some way to influence the VA mmap uses, gernally it should align to the natural page size of the device Well a mmap() needs to be aligned to the page size of the CPU, but not necessarily to the one of the device. So I'm pretty sure the device driver should not be involved in any way the choosing of the VA for the CPU mapping. Christian. We've had this discussion before and at that time I managed to convince you by pointing to the shmem helper for this, shmem_get_umapped_area(). No, you didn't convinced me. I was just surprised that this is something under driver control. Basically there are two ways to do this. Either use a standard helper similar to shmem's, and then the driver needs to align physical (device) huge page boundaries to address space offset huge page boundaries. If you don't do that you can just as well use a custom function that adjusts for you not doing that (drm_get_unmapped_area()). Both require driver knowledge of the size of huge pages. And once more, at least for GPU drivers that looks like the totally wrong approach to me. Aligning the VMA so that huge page allocations become possible is the job of the MM subsystem and not that of the drivers. Previous discussion here https://www.spinics.net/lists/linux-mm/msg205291.html Without a function to adjust, mmap will use it's default (16 byte?) alignment and chance of alignment becomes very small. Well it's 4KiB at least. Yes :/ ... Regards, Christian. Thanks, Thomas
Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
On 3/24/21 9:25 PM, Dave Hansen wrote: On 3/24/21 1:22 PM, Thomas Hellström (Intel) wrote: We also have not been careful at *all* about how _PAGE_BIT_SOFTW* are used. It's quite possible we can encode another use even in the existing bits. Personally, I'd just try: #define _PAGE_BIT_SOFTW5 57 /* available for programmer */ OK, I'll follow your advise here. FWIW I grepped for SW1 and it seems used in a selftest, but only for PTEs AFAICT. Oh, and we don't care about 32-bit much anymore? On x86, we have 64-bit PTEs when running 32-bit kernels if PAE is enabled. IOW, we can handle the majority of 32-bit CPUs out there. But, yeah, we don't care about 32-bit. :) Hmm, Actually it makes some sense to use SW1, to make it end up in the same dword as the PSE bit, as from what I can tell, reading of a 64-bit pmd_t on 32-bit PAE is not atomic, so in theory a huge pmd could be modified while reading the pmd_t making the dwords inconsistent How does that work with fast gup anyway? In any case, what would be the best cause of action here? Use SW1 or disable completely for 32-bit? /Thomas
Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
On 3/25/21 6:55 PM, Jason Gunthorpe wrote: On Thu, Mar 25, 2021 at 06:51:26PM +0100, Thomas Hellström (Intel) wrote: On 3/24/21 9:25 PM, Dave Hansen wrote: On 3/24/21 1:22 PM, Thomas Hellström (Intel) wrote: We also have not been careful at *all* about how _PAGE_BIT_SOFTW* are used. It's quite possible we can encode another use even in the existing bits. Personally, I'd just try: #define _PAGE_BIT_SOFTW5 57 /* available for programmer */ OK, I'll follow your advise here. FWIW I grepped for SW1 and it seems used in a selftest, but only for PTEs AFAICT. Oh, and we don't care about 32-bit much anymore? On x86, we have 64-bit PTEs when running 32-bit kernels if PAE is enabled. IOW, we can handle the majority of 32-bit CPUs out there. But, yeah, we don't care about 32-bit. :) Hmm, Actually it makes some sense to use SW1, to make it end up in the same dword as the PSE bit, as from what I can tell, reading of a 64-bit pmd_t on 32-bit PAE is not atomic, so in theory a huge pmd could be modified while reading the pmd_t making the dwords inconsistent How does that work with fast gup anyway? It loops to get an atomic 64 bit value if the arch can't provide an atomic 64 bit load Hmm, ok, I see a READ_ONCE() in gup_pmd_range(), and then the resulting pmd is dereferenced either in try_grab_compound_head() or __gup_device_huge(), before the pmd is compared to the value the pointer is currently pointing to. Couldn't those dereferences be on invalid pointers? /Thomas Jason
Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
On 3/25/21 7:24 PM, Jason Gunthorpe wrote: On Thu, Mar 25, 2021 at 07:13:33PM +0100, Thomas Hellström (Intel) wrote: On 3/25/21 6:55 PM, Jason Gunthorpe wrote: On Thu, Mar 25, 2021 at 06:51:26PM +0100, Thomas Hellström (Intel) wrote: On 3/24/21 9:25 PM, Dave Hansen wrote: On 3/24/21 1:22 PM, Thomas Hellström (Intel) wrote: We also have not been careful at *all* about how _PAGE_BIT_SOFTW* are used. It's quite possible we can encode another use even in the existing bits. Personally, I'd just try: #define _PAGE_BIT_SOFTW5 57 /* available for programmer */ OK, I'll follow your advise here. FWIW I grepped for SW1 and it seems used in a selftest, but only for PTEs AFAICT. Oh, and we don't care about 32-bit much anymore? On x86, we have 64-bit PTEs when running 32-bit kernels if PAE is enabled. IOW, we can handle the majority of 32-bit CPUs out there. But, yeah, we don't care about 32-bit. :) Hmm, Actually it makes some sense to use SW1, to make it end up in the same dword as the PSE bit, as from what I can tell, reading of a 64-bit pmd_t on 32-bit PAE is not atomic, so in theory a huge pmd could be modified while reading the pmd_t making the dwords inconsistent How does that work with fast gup anyway? It loops to get an atomic 64 bit value if the arch can't provide an atomic 64 bit load Hmm, ok, I see a READ_ONCE() in gup_pmd_range(), and then the resulting pmd is dereferenced either in try_grab_compound_head() or __gup_device_huge(), before the pmd is compared to the value the pointer is currently pointing to. Couldn't those dereferences be on invalid pointers? Uh.. That does look questionable, yes. Unless there is some tricky reason why a 64 bit pmd entry on a 32 bit arch either can't exist or has a stable upper 32 bits.. The pte does it with ptep_get_lockless(), we probably need the same for the other levels too instead of open coding a READ_ONCE? Jason Yes, unless that comment before local_irq_disable() means some magic is done to prevent bad things happening, but I guess if it's needed for ptes, it's probably needed for pmds and puds as well. /Thomas
Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
On 3/23/21 2:52 PM, Jason Gunthorpe wrote: On Sun, Mar 21, 2021 at 07:45:28PM +0100, Thomas Hellström (Intel) wrote: diff --git a/mm/gup.c b/mm/gup.c index e40579624f10..1b6a127f0bdd 100644 +++ b/mm/gup.c @@ -1993,6 +1993,17 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start, } #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL +/* + * If we can't determine whether or not a pte is special, then fail immediately + * for ptes. Note, we can still pin HugeTLB as it is guaranteed not to be + * special. For THP, special huge entries are indicated by xxx_devmap() + * returning true, but a corresponding call to get_dev_pagemap() will + * return NULL. + * + * For a futex to be placed on a THP tail page, get_futex_key requires a + * get_user_pages_fast_only implementation that can pin pages. Thus it's still + * useful to have gup_huge_pmd even if we can't operate on ptes. + */ Why move this comment? I think it was correct where it was Yes, you're right. I misread it to refer to the actual code in the gup_pte_range function rather than to the empty version. I'll move it back. /Thomas Jason
Re: [RFC PATCH 2/2] mm,drm/ttm: Use VM_PFNMAP for TTM vmas
On 3/23/21 3:00 PM, Jason Gunthorpe wrote: On Sun, Mar 21, 2021 at 07:45:29PM +0100, Thomas Hellström (Intel) wrote: To block fast gup we need to make sure TTM ptes are always special. With MIXEDMAP we, on architectures that don't support pte_special, insert normal ptes, but OTOH on those architectures, fast is not supported. At the same time, the function documentation to vm_normal_page() suggests that ptes pointing to system memory pages of MIXEDMAP vmas are always normal, but that doesn't seem consistent with what's implemented in vmf_insert_mixed(). I'm thus not entirely sure this patch is actually needed. But to make sure and to avoid also normal (non-fast) gup, make all TTM vmas PFNMAP. With PFNMAP we can't allow COW mappings anymore so make is_cow_mapping() available and use it to reject COW mappigs at mmap time. There was previously a comment in the code that WC mappings together with x86 PAT + PFNMAP was bad for performance. However from looking at vmf_insert_mixed() it looks like in the current code PFNMAP and MIXEDMAP are handled the same for architectures that support pte_special. This means there should not be a performance difference anymore, but this needs to be verified. Cc: Christian Koenig Cc: David Airlie Cc: Daniel Vetter Cc: Andrew Morton Cc: Jason Gunthorpe Cc: linux...@kvack.org Cc: dri-de...@lists.freedesktop.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Thomas Hellström (Intel) drivers/gpu/drm/ttm/ttm_bo_vm.c | 22 -- include/linux/mm.h | 5 + mm/internal.h | 5 - 3 files changed, 13 insertions(+), 19 deletions(-) diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c index 1c34983480e5..708c6fb9be81 100644 +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c @@ -372,12 +372,7 @@ vm_fault_t ttm_bo_vm_fault_reserved(struct vm_fault *vmf, * at arbitrary times while the data is mmap'ed. * See vmf_insert_mixed_prot() for a discussion. */ - if (vma->vm_flags & VM_MIXEDMAP) - ret = vmf_insert_mixed_prot(vma, address, - __pfn_to_pfn_t(pfn, PFN_DEV), - prot); - else - ret = vmf_insert_pfn_prot(vma, address, pfn, prot); + ret = vmf_insert_pfn_prot(vma, address, pfn, prot); /* Never error on prefaulted PTEs */ if (unlikely((ret & VM_FAULT_ERROR))) { @@ -555,18 +550,14 @@ static void ttm_bo_mmap_vma_setup(struct ttm_buffer_object *bo, struct vm_area_s * Note: We're transferring the bo reference to * vma->vm_private_data here. */ - vma->vm_private_data = bo; /* -* We'd like to use VM_PFNMAP on shared mappings, where -* (vma->vm_flags & VM_SHARED) != 0, for performance reasons, -* but for some reason VM_PFNMAP + x86 PAT + write-combine is very -* bad for performance. Until that has been sorted out, use -* VM_MIXEDMAP on all mappings. See freedesktop.org bug #75719 +* PFNMAP forces us to block COW mappings in mmap(), +* and with MIXEDMAP we would incorrectly allow fast gup +* on TTM memory on architectures that don't have pte_special. */ - vma->vm_flags |= VM_MIXEDMAP; - vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP; + vma->vm_flags |= VM_PFNMAP | VM_IO | VM_DONTEXPAND | VM_DONTDUMP; } int ttm_bo_mmap(struct file *filp, struct vm_area_struct *vma, @@ -579,6 +570,9 @@ int ttm_bo_mmap(struct file *filp, struct vm_area_struct *vma, if (unlikely(vma->vm_pgoff < DRM_FILE_PAGE_OFFSET_START)) return -EINVAL; + if (unlikely(is_cow_mapping(vma->vm_flags))) + return -EINVAL; + bo = ttm_bo_vm_lookup(bdev, vma->vm_pgoff, vma_pages(vma)); if (unlikely(!bo)) return -EINVAL; diff --git a/include/linux/mm.h b/include/linux/mm.h index 77e64e3eac80..c6ebf7f9ddbb 100644 +++ b/include/linux/mm.h @@ -686,6 +686,11 @@ static inline bool vma_is_accessible(struct vm_area_struct *vma) return vma->vm_flags & VM_ACCESS_FLAGS; } +static inline bool is_cow_mapping(vm_flags_t flags) +{ + return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE; +} Most driver places are just banning VM_SHARED. I see you copied this from remap_pfn_range(), but that logic is so special I'm not sure.. It's actually used all over the place. Both in drivers and also redefined with CONFIG_MEM_SOFT_DIRTY which makes me think Daniels idea of vma_is_cow_mapping() is better since it won't clash and cause compilation failures... Can the user mprotect the write back on with the above logic? No, it's blocked by mprotect. Do we n
Re: [RFC PATCH 2/2] mm,drm/ttm: Use VM_PFNMAP for TTM vmas
On 3/23/21 3:04 PM, Jason Gunthorpe wrote: On Tue, Mar 23, 2021 at 12:47:24PM +0100, Daniel Vetter wrote: +static inline bool is_cow_mapping(vm_flags_t flags) Bit a bikeshed, but I wonder whether the public interface shouldn't be vma_is_cow_mapping. Or whether this shouldn't be rejected somewhere else, since at least in drivers/gpu we have tons of cases that don't check for this and get it all kinds of wrong I think. remap_pfn_range handles this for many cases, but by far not for all. Anyway patch itself lgtm: Reviewed-by: Daniel Vetter I would like it if io_remap_pfn_range() did not allow shared mappings at all. You mean private mappings? IIRC it doesn't work anyway, the kernel can't reliably copy from IO pages eg the "_copy_from_user_inatomic()" under cow_user_page() will not work on s390 that requires all IO memory be accessed with special instructions. Unfortunately I have no idea what the long ago special case of allowing COW'd IO mappings is. :\ Me neither, but at some point it must have been important enough to introduce VM_MIXEDMAP... /Thomas Jason
Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
Hi, On 3/23/21 12:34 PM, Daniel Vetter wrote: On Sun, Mar 21, 2021 at 07:45:28PM +0100, Thomas Hellström (Intel) wrote: TTM sets up huge page-table-entries both to system- and device memory, and we don't want gup to assume there are always valid backing struct pages for these. For PTEs this is handled by setting the pte_special bit, but for the huge PUDs and PMDs, we have neither pmd_special nor pud_special. Normally, huge TTM entries are identified by looking at vma_is_special_huge(), but fast gup can't do that, so as an alternative define _devmap entries for which there are no backing dev_pagemap as special, update documentation and make huge TTM entries _devmap, after verifying that there is no backing dev_pagemap. One other alternative would be to block TTM huge page-table-entries completely, and while currently only vmwgfx use them, they would be beneficial to other graphis drivers moving forward as well. Cc: Christian Koenig Cc: David Airlie Cc: Daniel Vetter Cc: Andrew Morton Cc: Jason Gunthorpe Cc: linux...@kvack.org Cc: dri-de...@lists.freedesktop.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Thomas Hellström (Intel) --- drivers/gpu/drm/ttm/ttm_bo_vm.c | 17 - mm/gup.c| 21 +++-- mm/memremap.c | 5 + 3 files changed, 32 insertions(+), 11 deletions(-) diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c index 6dc96cf66744..1c34983480e5 100644 --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c @@ -195,6 +195,7 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault *vmf, pfn_t pfnt; struct ttm_tt *ttm = bo->ttm; bool write = vmf->flags & FAULT_FLAG_WRITE; + struct dev_pagemap *pagemap; /* Fault should not cross bo boundary. */ page_offset &= ~(fault_page_size - 1); @@ -210,6 +211,20 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault *vmf, if ((pfn & (fault_page_size - 1)) != 0) goto out_fallback; + /* +* Huge entries must be special, that is marking them as devmap +* with no backing device map range. If there is a backing +* range, Don't insert a huge entry. +* If this check turns out to be too much of a performance hit, +* we can instead have drivers indicate whether they may have +* backing device map ranges and if not, skip this lookup. +*/ I think we can do this statically: - if it's system memory we know there's no devmap for it, and we do the trick to block gup_fast Yes, that should work. - if it's iomem, we know gup_fast wont work anyway if don't set PFN_DEV, so might as well not do that I think gup_fast will unfortunately mistake a huge iomem page for an ordinary page and try to access a non-existant struct page for it, unless we do the devmap trick. And the lookup would then be for the rare case where a driver would have already registered a dev_pagemap for an iomem area which may also be mapped through TTM (like the patch from Felix a couple of weeks ago). If a driver can promise not to do that, then we can safely remove the lookup. /Thomas
Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
On 3/23/21 5:37 PM, Jason Gunthorpe wrote: On Tue, Mar 23, 2021 at 05:34:51PM +0100, Thomas Hellström (Intel) wrote: @@ -210,6 +211,20 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault *vmf, if ((pfn & (fault_page_size - 1)) != 0) goto out_fallback; + /* +* Huge entries must be special, that is marking them as devmap +* with no backing device map range. If there is a backing +* range, Don't insert a huge entry. +* If this check turns out to be too much of a performance hit, +* we can instead have drivers indicate whether they may have +* backing device map ranges and if not, skip this lookup. +*/ I think we can do this statically: - if it's system memory we know there's no devmap for it, and we do the trick to block gup_fast Yes, that should work. - if it's iomem, we know gup_fast wont work anyway if don't set PFN_DEV, so might as well not do that I think gup_fast will unfortunately mistake a huge iomem page for an ordinary page and try to access a non-existant struct page for it, unless we do the devmap trick. And the lookup would then be for the rare case where a driver would have already registered a dev_pagemap for an iomem area which may also be mapped through TTM (like the patch from Felix a couple of weeks ago). If a driver can promise not to do that, then we can safely remove the lookup. Isn't the devmap PTE flag arch optional? Does this fall back to not using huge pages on arches that don't support it? Good point. No, currently it's only conditioned on transhuge page support. Need to condition it on also devmap support. Also, I feel like this code to install "pte_special" huge pages does not belong in the drm subsystem.. I could add helpers in huge_memory.c: vmf_insert_pfn_pmd_prot_special() and vmf_insert_pfn_pud_prot_special() /Thomas Jason
Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
On 3/23/21 8:52 PM, Williams, Dan J wrote: On Sun, 2021-03-21 at 19:45 +0100, Thomas Hellström (Intel) wrote: TTM sets up huge page-table-entries both to system- and device memory, and we don't want gup to assume there are always valid backing struct pages for these. For PTEs this is handled by setting the pte_special bit, but for the huge PUDs and PMDs, we have neither pmd_special nor pud_special. Normally, huge TTM entries are identified by looking at vma_is_special_huge(), but fast gup can't do that, so as an alternative define _devmap entries for which there are no backing dev_pagemap as special, update documentation and make huge TTM entries _devmap, after verifying that there is no backing dev_pagemap. Please do not abuse p{m,u}d_devmap like this. I'm in the process of removing get_devpagemap() from the gup-fast path [1]. Instead there should be space for p{m,u}d_special in the page table entries (at least for x86-64). So the fix is to remove that old assumption that huge pages can never be special. [1]: http://lore.kernel.org/r/161604050866.1463742.7759521510383551055.st...@dwillia2-desk3.amr.corp.intel.com Hmm, yes with that patch it will obviously not work as intended. Given that, I think we'll need to disable the TTM huge pages for now until we can sort out and agree on using a page table entry bit. Thanks, /Thomas
[RFC PATCH 2/2] mm,drm/ttm: Use VM_PFNMAP for TTM vmas
To block fast gup we need to make sure TTM ptes are always special. With MIXEDMAP we, on architectures that don't support pte_special, insert normal ptes, but OTOH on those architectures, fast is not supported. At the same time, the function documentation to vm_normal_page() suggests that ptes pointing to system memory pages of MIXEDMAP vmas are always normal, but that doesn't seem consistent with what's implemented in vmf_insert_mixed(). I'm thus not entirely sure this patch is actually needed. But to make sure and to avoid also normal (non-fast) gup, make all TTM vmas PFNMAP. With PFNMAP we can't allow COW mappings anymore so make is_cow_mapping() available and use it to reject COW mappigs at mmap time. There was previously a comment in the code that WC mappings together with x86 PAT + PFNMAP was bad for performance. However from looking at vmf_insert_mixed() it looks like in the current code PFNMAP and MIXEDMAP are handled the same for architectures that support pte_special. This means there should not be a performance difference anymore, but this needs to be verified. Cc: Christian Koenig Cc: David Airlie Cc: Daniel Vetter Cc: Andrew Morton Cc: Jason Gunthorpe Cc: linux...@kvack.org Cc: dri-de...@lists.freedesktop.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Thomas Hellström (Intel) --- drivers/gpu/drm/ttm/ttm_bo_vm.c | 22 -- include/linux/mm.h | 5 + mm/internal.h | 5 - 3 files changed, 13 insertions(+), 19 deletions(-) diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c index 1c34983480e5..708c6fb9be81 100644 --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c @@ -372,12 +372,7 @@ vm_fault_t ttm_bo_vm_fault_reserved(struct vm_fault *vmf, * at arbitrary times while the data is mmap'ed. * See vmf_insert_mixed_prot() for a discussion. */ - if (vma->vm_flags & VM_MIXEDMAP) - ret = vmf_insert_mixed_prot(vma, address, - __pfn_to_pfn_t(pfn, PFN_DEV), - prot); - else - ret = vmf_insert_pfn_prot(vma, address, pfn, prot); + ret = vmf_insert_pfn_prot(vma, address, pfn, prot); /* Never error on prefaulted PTEs */ if (unlikely((ret & VM_FAULT_ERROR))) { @@ -555,18 +550,14 @@ static void ttm_bo_mmap_vma_setup(struct ttm_buffer_object *bo, struct vm_area_s * Note: We're transferring the bo reference to * vma->vm_private_data here. */ - vma->vm_private_data = bo; /* -* We'd like to use VM_PFNMAP on shared mappings, where -* (vma->vm_flags & VM_SHARED) != 0, for performance reasons, -* but for some reason VM_PFNMAP + x86 PAT + write-combine is very -* bad for performance. Until that has been sorted out, use -* VM_MIXEDMAP on all mappings. See freedesktop.org bug #75719 +* PFNMAP forces us to block COW mappings in mmap(), +* and with MIXEDMAP we would incorrectly allow fast gup +* on TTM memory on architectures that don't have pte_special. */ - vma->vm_flags |= VM_MIXEDMAP; - vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP; + vma->vm_flags |= VM_PFNMAP | VM_IO | VM_DONTEXPAND | VM_DONTDUMP; } int ttm_bo_mmap(struct file *filp, struct vm_area_struct *vma, @@ -579,6 +570,9 @@ int ttm_bo_mmap(struct file *filp, struct vm_area_struct *vma, if (unlikely(vma->vm_pgoff < DRM_FILE_PAGE_OFFSET_START)) return -EINVAL; + if (unlikely(is_cow_mapping(vma->vm_flags))) + return -EINVAL; + bo = ttm_bo_vm_lookup(bdev, vma->vm_pgoff, vma_pages(vma)); if (unlikely(!bo)) return -EINVAL; diff --git a/include/linux/mm.h b/include/linux/mm.h index 77e64e3eac80..c6ebf7f9ddbb 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -686,6 +686,11 @@ static inline bool vma_is_accessible(struct vm_area_struct *vma) return vma->vm_flags & VM_ACCESS_FLAGS; } +static inline bool is_cow_mapping(vm_flags_t flags) +{ + return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE; +} + #ifdef CONFIG_SHMEM /* * The vma_is_shmem is not inline because it is used only by slow diff --git a/mm/internal.h b/mm/internal.h index 9902648f2206..1432feec62df 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -296,11 +296,6 @@ static inline unsigned int buddy_order(struct page *page) */ #define buddy_order_unsafe(page) READ_ONCE(page_private(page)) -static inline bool is_cow_mapping(vm_flags_t flags) -{ - return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE; -} - /* * These three helpers classifies VMAs for virtual memory accounting. */ -- 2.30.2
[RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
TTM sets up huge page-table-entries both to system- and device memory, and we don't want gup to assume there are always valid backing struct pages for these. For PTEs this is handled by setting the pte_special bit, but for the huge PUDs and PMDs, we have neither pmd_special nor pud_special. Normally, huge TTM entries are identified by looking at vma_is_special_huge(), but fast gup can't do that, so as an alternative define _devmap entries for which there are no backing dev_pagemap as special, update documentation and make huge TTM entries _devmap, after verifying that there is no backing dev_pagemap. One other alternative would be to block TTM huge page-table-entries completely, and while currently only vmwgfx use them, they would be beneficial to other graphis drivers moving forward as well. Cc: Christian Koenig Cc: David Airlie Cc: Daniel Vetter Cc: Andrew Morton Cc: Jason Gunthorpe Cc: linux...@kvack.org Cc: dri-de...@lists.freedesktop.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Thomas Hellström (Intel) --- drivers/gpu/drm/ttm/ttm_bo_vm.c | 17 - mm/gup.c| 21 +++-- mm/memremap.c | 5 + 3 files changed, 32 insertions(+), 11 deletions(-) diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c index 6dc96cf66744..1c34983480e5 100644 --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c @@ -195,6 +195,7 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault *vmf, pfn_t pfnt; struct ttm_tt *ttm = bo->ttm; bool write = vmf->flags & FAULT_FLAG_WRITE; + struct dev_pagemap *pagemap; /* Fault should not cross bo boundary. */ page_offset &= ~(fault_page_size - 1); @@ -210,6 +211,20 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault *vmf, if ((pfn & (fault_page_size - 1)) != 0) goto out_fallback; + /* +* Huge entries must be special, that is marking them as devmap +* with no backing device map range. If there is a backing +* range, Don't insert a huge entry. +* If this check turns out to be too much of a performance hit, +* we can instead have drivers indicate whether they may have +* backing device map ranges and if not, skip this lookup. +*/ + pagemap = get_dev_pagemap(pfn, NULL); + if (pagemap) { + put_dev_pagemap(pagemap); + goto out_fallback; + } + /* Check that memory is contiguous. */ if (!bo->mem.bus.is_iomem) { for (i = 1; i < fault_page_size; ++i) { @@ -223,7 +238,7 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault *vmf, } } - pfnt = __pfn_to_pfn_t(pfn, PFN_DEV); + pfnt = __pfn_to_pfn_t(pfn, PFN_DEV | PFN_MAP); if (fault_page_size == (HPAGE_PMD_SIZE >> PAGE_SHIFT)) ret = vmf_insert_pfn_pmd_prot(vmf, pfnt, pgprot, write); #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD diff --git a/mm/gup.c b/mm/gup.c index e40579624f10..1b6a127f0bdd 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1993,6 +1993,17 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start, } #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL +/* + * If we can't determine whether or not a pte is special, then fail immediately + * for ptes. Note, we can still pin HugeTLB as it is guaranteed not to be + * special. For THP, special huge entries are indicated by xxx_devmap() + * returning true, but a corresponding call to get_dev_pagemap() will + * return NULL. + * + * For a futex to be placed on a THP tail page, get_futex_key requires a + * get_user_pages_fast_only implementation that can pin pages. Thus it's still + * useful to have gup_huge_pmd even if we can't operate on ptes. + */ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, unsigned int flags, struct page **pages, int *nr) { @@ -2069,16 +2080,6 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, return ret; } #else - -/* - * If we can't determine whether or not a pte is special, then fail immediately - * for ptes. Note, we can still pin HugeTLB and THP as these are guaranteed not - * to be special. - * - * For a futex to be placed on a THP tail page, get_futex_key requires a - * get_user_pages_fast_only implementation that can pin pages. Thus it's still - * useful to have gup_huge_pmd even if we can't operate on ptes. - */ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, unsigned int flags, struct page **pages, int *nr) { diff --git a/mm/memremap.c b/mm/memremap.c index 7aa7d6e80ee5..757551cd2a4d 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -471,6 +471,11 @@ void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns)
[RFC PATCH 0/2] mm,drm/ttm: Always block GUP to TTM pages
get_user_pages() to TTM pages is uwanted since TTM assumes it owns the pages exclusively and / or sets up page-table mappings to io memory. The first patch make sures we stop fast gup to huge TTM pages using a trick with pmd_devmap() and pud_devmap() without a backing dev_pagemap. The second patch makes sure we block normal gup by setting VM_PFNMAP Cc: Christian Koenig Cc: David Airlie Cc: Daniel Vetter Cc: Andrew Morton Cc: Jason Gunthorpe Cc: linux...@kvack.org Cc: dri-de...@lists.freedesktop.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Thomas Hellström (Intel)
Re: [RFC PATCH 2/2] mm,drm/ttm: Use VM_PFNMAP for TTM vmas
Hi! On 3/22/21 8:47 AM, Christian König wrote: Am 21.03.21 um 19:45 schrieb Thomas Hellström (Intel): To block fast gup we need to make sure TTM ptes are always special. With MIXEDMAP we, on architectures that don't support pte_special, insert normal ptes, but OTOH on those architectures, fast is not supported. At the same time, the function documentation to vm_normal_page() suggests that ptes pointing to system memory pages of MIXEDMAP vmas are always normal, but that doesn't seem consistent with what's implemented in vmf_insert_mixed(). I'm thus not entirely sure this patch is actually needed. But to make sure and to avoid also normal (non-fast) gup, make all TTM vmas PFNMAP. With PFNMAP we can't allow COW mappings anymore so make is_cow_mapping() available and use it to reject COW mappigs at mmap time. I would separate the disallowing of COW mapping from the PFN change. I'm pretty sure that COW mappings never worked on TTM BOs in the first place. COW doesn't work with PFNMAP together with non-linear maps, so as a consequence from moving from MIXEDMAP to PFNMAP we must disallow COW, so it seems logical to me to do it in one patch. And working COW was one of the tests I used for huge PMDs/PUDs, so it has indeed been working, but I can't think of any relevant use-cases. Did you, BTW, have a chance to test this with WC mappings? Thanks, /Thomas But either way this patch is Reviewed-by: Christian König . Thanks, Christian. There was previously a comment in the code that WC mappings together with x86 PAT + PFNMAP was bad for performance. However from looking at vmf_insert_mixed() it looks like in the current code PFNMAP and MIXEDMAP are handled the same for architectures that support pte_special. This means there should not be a performance difference anymore, but this needs to be verified. Cc: Christian Koenig Cc: David Airlie Cc: Daniel Vetter Cc: Andrew Morton Cc: Jason Gunthorpe Cc: linux...@kvack.org Cc: dri-de...@lists.freedesktop.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Thomas Hellström (Intel) --- drivers/gpu/drm/ttm/ttm_bo_vm.c | 22 -- include/linux/mm.h | 5 + mm/internal.h | 5 - 3 files changed, 13 insertions(+), 19 deletions(-) diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c index 1c34983480e5..708c6fb9be81 100644 --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c @@ -372,12 +372,7 @@ vm_fault_t ttm_bo_vm_fault_reserved(struct vm_fault *vmf, * at arbitrary times while the data is mmap'ed. * See vmf_insert_mixed_prot() for a discussion. */ - if (vma->vm_flags & VM_MIXEDMAP) - ret = vmf_insert_mixed_prot(vma, address, - __pfn_to_pfn_t(pfn, PFN_DEV), - prot); - else - ret = vmf_insert_pfn_prot(vma, address, pfn, prot); + ret = vmf_insert_pfn_prot(vma, address, pfn, prot); /* Never error on prefaulted PTEs */ if (unlikely((ret & VM_FAULT_ERROR))) { @@ -555,18 +550,14 @@ static void ttm_bo_mmap_vma_setup(struct ttm_buffer_object *bo, struct vm_area_s * Note: We're transferring the bo reference to * vma->vm_private_data here. */ - vma->vm_private_data = bo; /* - * We'd like to use VM_PFNMAP on shared mappings, where - * (vma->vm_flags & VM_SHARED) != 0, for performance reasons, - * but for some reason VM_PFNMAP + x86 PAT + write-combine is very - * bad for performance. Until that has been sorted out, use - * VM_MIXEDMAP on all mappings. See freedesktop.org bug #75719 + * PFNMAP forces us to block COW mappings in mmap(), + * and with MIXEDMAP we would incorrectly allow fast gup + * on TTM memory on architectures that don't have pte_special. */ - vma->vm_flags |= VM_MIXEDMAP; - vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP; + vma->vm_flags |= VM_PFNMAP | VM_IO | VM_DONTEXPAND | VM_DONTDUMP; } int ttm_bo_mmap(struct file *filp, struct vm_area_struct *vma, @@ -579,6 +570,9 @@ int ttm_bo_mmap(struct file *filp, struct vm_area_struct *vma, if (unlikely(vma->vm_pgoff < DRM_FILE_PAGE_OFFSET_START)) return -EINVAL; + if (unlikely(is_cow_mapping(vma->vm_flags))) + return -EINVAL; + bo = ttm_bo_vm_lookup(bdev, vma->vm_pgoff, vma_pages(vma)); if (unlikely(!bo)) return -EINVAL; diff --git a/include/linux/mm.h b/include/linux/mm.h index 77e64e3eac80..c6ebf7f9ddbb 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -686,6 +686,11 @@ static inline bool vma_is_accessible(struct vm_area_struct *vma) return vma->vm_flags & VM_ACCESS_FLAGS; } +static inline bool is_cow_mapping(vm_flags_t flags) +{ + return (flags
Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
On 3/25/21 7:24 PM, Jason Gunthorpe wrote: On Thu, Mar 25, 2021 at 07:13:33PM +0100, Thomas Hellström (Intel) wrote: On 3/25/21 6:55 PM, Jason Gunthorpe wrote: On Thu, Mar 25, 2021 at 06:51:26PM +0100, Thomas Hellström (Intel) wrote: On 3/24/21 9:25 PM, Dave Hansen wrote: On 3/24/21 1:22 PM, Thomas Hellström (Intel) wrote: We also have not been careful at *all* about how _PAGE_BIT_SOFTW* are used. It's quite possible we can encode another use even in the existing bits. Personally, I'd just try: #define _PAGE_BIT_SOFTW5 57 /* available for programmer */ OK, I'll follow your advise here. FWIW I grepped for SW1 and it seems used in a selftest, but only for PTEs AFAICT. Oh, and we don't care about 32-bit much anymore? On x86, we have 64-bit PTEs when running 32-bit kernels if PAE is enabled. IOW, we can handle the majority of 32-bit CPUs out there. But, yeah, we don't care about 32-bit. :) Hmm, Actually it makes some sense to use SW1, to make it end up in the same dword as the PSE bit, as from what I can tell, reading of a 64-bit pmd_t on 32-bit PAE is not atomic, so in theory a huge pmd could be modified while reading the pmd_t making the dwords inconsistent How does that work with fast gup anyway? It loops to get an atomic 64 bit value if the arch can't provide an atomic 64 bit load Hmm, ok, I see a READ_ONCE() in gup_pmd_range(), and then the resulting pmd is dereferenced either in try_grab_compound_head() or __gup_device_huge(), before the pmd is compared to the value the pointer is currently pointing to. Couldn't those dereferences be on invalid pointers? Uh.. That does look questionable, yes. Unless there is some tricky reason why a 64 bit pmd entry on a 32 bit arch either can't exist or has a stable upper 32 bits.. The pte does it with ptep_get_lockless(), we probably need the same for the other levels too instead of open coding a READ_ONCE? Jason TBH, ptep_get_lockless() also looks a bit fishy. it says "it will not switch to a completely different present page without a TLB flush in between". What if the following happens: processor 1: Reads lower dword of PTE. processor 2: Zaps PTE. Gets stuck waiting to do TLB flush processor 1: Reads upper dword of PTE, which is now zero. processor 3: Hits a TLB miss, reads an unpopulated PTE and faults in a new PTE value which happens to be the same as the original one before the zap. processor 1: Reads the newly faulted in lower dword, compares to the old one, gives an OK and returns a bogus PTE. /Thomas
Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
On 3/26/21 12:46 PM, Jason Gunthorpe wrote: On Fri, Mar 26, 2021 at 10:08:09AM +0100, Thomas Hellström (Intel) wrote: On 3/25/21 7:24 PM, Jason Gunthorpe wrote: On Thu, Mar 25, 2021 at 07:13:33PM +0100, Thomas Hellström (Intel) wrote: On 3/25/21 6:55 PM, Jason Gunthorpe wrote: On Thu, Mar 25, 2021 at 06:51:26PM +0100, Thomas Hellström (Intel) wrote: On 3/24/21 9:25 PM, Dave Hansen wrote: On 3/24/21 1:22 PM, Thomas Hellström (Intel) wrote: We also have not been careful at *all* about how _PAGE_BIT_SOFTW* are used. It's quite possible we can encode another use even in the existing bits. Personally, I'd just try: #define _PAGE_BIT_SOFTW5 57 /* available for programmer */ OK, I'll follow your advise here. FWIW I grepped for SW1 and it seems used in a selftest, but only for PTEs AFAICT. Oh, and we don't care about 32-bit much anymore? On x86, we have 64-bit PTEs when running 32-bit kernels if PAE is enabled. IOW, we can handle the majority of 32-bit CPUs out there. But, yeah, we don't care about 32-bit. :) Hmm, Actually it makes some sense to use SW1, to make it end up in the same dword as the PSE bit, as from what I can tell, reading of a 64-bit pmd_t on 32-bit PAE is not atomic, so in theory a huge pmd could be modified while reading the pmd_t making the dwords inconsistent How does that work with fast gup anyway? It loops to get an atomic 64 bit value if the arch can't provide an atomic 64 bit load Hmm, ok, I see a READ_ONCE() in gup_pmd_range(), and then the resulting pmd is dereferenced either in try_grab_compound_head() or __gup_device_huge(), before the pmd is compared to the value the pointer is currently pointing to. Couldn't those dereferences be on invalid pointers? Uh.. That does look questionable, yes. Unless there is some tricky reason why a 64 bit pmd entry on a 32 bit arch either can't exist or has a stable upper 32 bits.. The pte does it with ptep_get_lockless(), we probably need the same for the other levels too instead of open coding a READ_ONCE? Jason TBH, ptep_get_lockless() also looks a bit fishy. it says "it will not switch to a completely different present page without a TLB flush in between". What if the following happens: processor 1: Reads lower dword of PTE. processor 2: Zaps PTE. Gets stuck waiting to do TLB flush processor 1: Reads upper dword of PTE, which is now zero. processor 3: Hits a TLB miss, reads an unpopulated PTE and faults in a new PTE value which happens to be the same as the original one before the zap. processor 1: Reads the newly faulted in lower dword, compares to the old one, gives an OK and returns a bogus PTE. So you are saying that while the zap will wait for the TLB flush to globally finish once it gets started any other processor can still write to the pte? I can't think of any serialization that would cause fault to wait for the zap/TLB flush, especially if the zap comes from the address_space and doesn't hold the mmap lock. I might of course be completely wrong, but It seems there is an assumption made that all potentially affected processors would have a valid TLB entry for the PTE. Then the fault would not happen (well unless of course the TLB flush completes on some processors before getting stuck on the local_irq_disable() on processor 1). +CC: Nick Piggin Seems like Nick Piggin is the original author of the comment. Perhaps he can can clarify a bit. /Thomas Seems worth bringing up in a bigger thread, maybe someone else knows? Jason
Re: [PATCH 01/18] mm: Track mmu notifiers in fs_reclaim_acquire/release
Hi, Daniel, Please see below. On 6/4/20 10:12 AM, Daniel Vetter wrote: fs_reclaim_acquire/release nicely catch recursion issues when allocating GFP_KERNEL memory against shrinkers (which gpu drivers tend to use to keep the excessive caches in check). For mmu notifier recursions we do have lockdep annotations since 23b68395c7c7 ("mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end"). But these only fire if a path actually results in some pte invalidation - for most small allocations that's very rarely the case. The other trouble is that pte invalidation can happen any time when __GFP_RECLAIM is set. Which means only really GFP_ATOMIC is a safe choice, GFP_NOIO isn't good enough to avoid potential mmu notifier recursion. I was pondering whether we should just do the general annotation, but there's always the risk for false positives. Plus I'm assuming that the core fs and io code is a lot better reviewed and tested than random mmu notifier code in drivers. Hence why I decide to only annotate for that specific case. Furthermore even if we'd create a lockdep map for direct reclaim, we'd still need to explicit pull in the mmu notifier map - there's a lot more places that do pte invalidation than just direct reclaim, these two contexts arent the same. Note that the mmu notifiers needing their own independent lockdep map is also the reason we can't hold them from fs_reclaim_acquire to fs_reclaim_release - it would nest with the acquistion in the pte invalidation code, causing a lockdep splat. And we can't remove the annotations from pte invalidation and all the other places since they're called from many other places than page reclaim. Hence we can only do the equivalent of might_lock, but on the raw lockdep map. With this we can also remove the lockdep priming added in 66204f1d2d1b ("mm/mmu_notifiers: prime lockdep") since the new annotations are strictly more powerful. Cc: Andrew Morton Cc: Jason Gunthorpe Cc: linux...@kvack.org Cc: linux-r...@vger.kernel.org Cc: Maarten Lankhorst Cc: Christian König Signed-off-by: Daniel Vetter --- This is part of a gpu lockdep annotation series simply because it really helps to catch issues where gpu subsystem locks and primitives can deadlock with themselves through allocations and mmu notifiers. But aside from that motivation it should be completely free-standing, and can land through -mm/-rdma/-hmm or any other tree really whenever. -Daniel --- mm/mmu_notifier.c | 7 --- mm/page_alloc.c | 23 ++- 2 files changed, 14 insertions(+), 16 deletions(-) diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index 06852b896fa6..5d578b9122f8 100644 --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -612,13 +612,6 @@ int __mmu_notifier_register(struct mmu_notifier *subscription, lockdep_assert_held_write(&mm->mmap_sem); BUG_ON(atomic_read(&mm->mm_users) <= 0); - if (IS_ENABLED(CONFIG_LOCKDEP)) { - fs_reclaim_acquire(GFP_KERNEL); - lock_map_acquire(&__mmu_notifier_invalidate_range_start_map); - lock_map_release(&__mmu_notifier_invalidate_range_start_map); - fs_reclaim_release(GFP_KERNEL); - } - if (!mm->notifier_subscriptions) { /* * kmalloc cannot be called under mm_take_all_locks(), but we diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 13cc653122b7..f8a222db4a53 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -57,6 +57,7 @@ #include #include #include +#include #include #include #include @@ -4124,7 +4125,7 @@ should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_fla static struct lockdep_map __fs_reclaim_map = STATIC_LOCKDEP_MAP_INIT("fs_reclaim", &__fs_reclaim_map); -static bool __need_fs_reclaim(gfp_t gfp_mask) +static bool __need_reclaim(gfp_t gfp_mask) { gfp_mask = current_gfp_context(gfp_mask); @@ -4136,10 +4137,6 @@ static bool __need_fs_reclaim(gfp_t gfp_mask) if (current->flags & PF_MEMALLOC) return false; - /* We're only interested __GFP_FS allocations for now */ - if (!(gfp_mask & __GFP_FS)) - return false; - if (gfp_mask & __GFP_NOLOCKDEP) return false; @@ -4158,15 +4155,23 @@ void __fs_reclaim_release(void) void fs_reclaim_acquire(gfp_t gfp_mask) { - if (__need_fs_reclaim(gfp_mask)) - __fs_reclaim_acquire(); + if (__need_reclaim(gfp_mask)) { + if (!(gfp_mask & __GFP_FS)) Hmm. Shouldn't this be "if (gfp_mask & __GFP_FS)" or am I misunderstanding? + __fs_reclaim_acquire(); #ifdef CONFIG_MMU_NOTIFIER? + + lock_map_acquire(&__mmu_notifier_invalidate_range_start_map); + lock_map_release(&__mmu_notifier_invalidate_range_start_map); + + } } EXPORT_SYMBOL_GPL(fs_reclaim_acquire); void fs_reclaim_release(gfp_t gfp_mask) { -
Re: [PATCH 02/18] dma-buf: minor doc touch-ups
On 6/4/20 10:12 AM, Daniel Vetter wrote: Just some tiny edits: - fix link to struct dma_fence - give slightly more meaningful title - the polling here is about implicit fences, explicit fences (in sync_file or drm_syncobj) also have their own polling Signed-off-by: Daniel Vetter Reviewed-by: Thomas Hellstrom
Re: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations
On 6/4/20 10:12 AM, Daniel Vetter wrote: Two in one go: - it is allowed to call dma_fence_wait() while holding a dma_resv_lock(). This is fundamental to how eviction works with ttm, so required. - it is allowed to call dma_fence_wait() from memory reclaim contexts, specifically from shrinker callbacks (which i915 does), and from mmu notifier callbacks (which amdgpu does, and which i915 sometimes also does, and probably always should, but that's kinda a debate). Also for stuff like HMM we really need to be able to do this, or things get real dicey. Consequence is that any critical path necessary to get to a dma_fence_signal for a fence must never a) call dma_resv_lock nor b) allocate memory with GFP_KERNEL. Also by implication of dma_resv_lock(), no userspace faulting allowed. That's some supremely obnoxious limitations, which is why we need to sprinkle the right annotations to all relevant paths. The one big locking context we're leaving out here is mmu notifiers, added in commit 23b68395c7c78a764e8963fc15a7cfd318bf187f Author: Daniel Vetter Date: Mon Aug 26 22:14:21 2019 +0200 mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end that one covers a lot of other callsites, and it's also allowed to wait on dma-fences from mmu notifiers. But there's no ready-made functions exposed to prime this, so I've left it out for now. v2: Also track against mmu notifier context. v3: kerneldoc to spec the cross-driver contract. Note that currently i915 throws in a hard-coded 10s timeout on foreign fences (not sure why that was done, but it's there), which is why that rule is worded with SHOULD instead of MUST. Also some of the mmu_notifier/shrinker rules might surprise SoC drivers, I haven't fully audited them all. Which is infeasible anyway, we'll need to run them with lockdep and dma-fence annotations and see what goes boom. v4: A spelling fix from Mika Cc: Mika Kuoppala Cc: Thomas Hellstrom Cc: linux-me...@vger.kernel.org Cc: linaro-mm-...@lists.linaro.org Cc: linux-r...@vger.kernel.org Cc: amd-...@lists.freedesktop.org Cc: intel-...@lists.freedesktop.org Cc: Chris Wilson Cc: Maarten Lankhorst Cc: Christian König Signed-off-by: Daniel Vetter --- Documentation/driver-api/dma-buf.rst | 6 drivers/dma-buf/dma-fence.c | 41 drivers/dma-buf/dma-resv.c | 4 +++ include/linux/dma-fence.h| 1 + 4 files changed, 52 insertions(+) I still have my doubts about allowing fence waiting from within shrinkers. IMO ideally they should use a trywait approach, in order to allow memory allocation during command submission for drivers that publish fences before command submission. (Since early reservation object release requires that). But since drivers are already waiting from within shrinkers and I take your word for HMM requiring this, Reviewed-by: Thomas Hellström
Re: [RFC 02/17] dma-fence: basic lockdep annotations
On 2020-05-12 10:59, Daniel Vetter wrote: Design is similar to the lockdep annotations for workers, but with some twists: - We use a read-lock for the execution/worker/completion side, so that this explicit annotation can be more liberally sprinkled around. With read locks lockdep isn't going to complain if the read-side isn't nested the same way under all circumstances, so ABBA deadlocks are ok. Which they are, since this is an annotation only. - We're using non-recursive lockdep read lock mode, since in recursive read lock mode lockdep does not catch read side hazards. And we _very_ much want read side hazards to be caught. For full details of this limitation see commit e91498589746065e3ae95d9a00b068e525eec34f Author: Peter Zijlstra Date: Wed Aug 23 13:13:11 2017 +0200 locking/lockdep/selftests: Add mixed read-write ABBA tests - To allow nesting of the read-side explicit annotations we explicitly keep track of the nesting. lock_is_held() allows us to do that. - The wait-side annotation is a write lock, and entirely done within dma_fence_wait() for everyone by default. - To be able to freely annotate helper functions I want to make it ok to call dma_fence_begin/end_signalling from soft/hardirq context. First attempt was using the hardirq locking context for the write side in lockdep, but this forces all normal spinlocks nested within dma_fence_begin/end_signalling to be spinlocks. That bollocks. The approach now is to simple check in_atomic(), and for these cases entirely rely on the might_sleep() check in dma_fence_wait(). That will catch any wrong nesting against spinlocks from soft/hardirq contexts. The idea here is that every code path that's critical for eventually signalling a dma_fence should be annotated with dma_fence_begin/end_signalling. The annotation ideally starts right after a dma_fence is published (added to a dma_resv, exposed as a sync_file fd, attached to a drm_syncobj fd, or anything else that makes the dma_fence visible to other kernel threads), up to and including the dma_fence_wait(). Examples are irq handlers, the scheduler rt threads, the tail of execbuf (after the corresponding fences are visible), any workers that end up signalling dma_fences and really anything else. Not annotated should be code paths that only complete fences opportunistically as the gpu progresses, like e.g. shrinker/eviction code. The main class of deadlocks this is supposed to catch are: Thread A: mutex_lock(A); mutex_unlock(A); dma_fence_signal(); Thread B: mutex_lock(A); dma_fence_wait(); mutex_unlock(A); Thread B is blocked on A signalling the fence, but A never gets around to that because it cannot acquire the lock A. Note that dma_fence_wait() is allowed to be nested within dma_fence_begin/end_signalling sections. To allow this to happen the read lock needs to be upgraded to a write lock, which means that any other lock is acquired between the dma_fence_begin_signalling() call and the call to dma_fence_wait(), and still held, this will result in an immediate lockdep complaint. The only other option would be to not annotate such calls, defeating the point. Therefore these annotations cannot be sprinkled over the code entirely mindless to avoid false positives. v2: handle soft/hardirq ctx better against write side and dont forget EXPORT_SYMBOL, drivers can't use this otherwise. Cc: linux-me...@vger.kernel.org Cc: linaro-mm-...@lists.linaro.org Cc: linux-r...@vger.kernel.org Cc: amd-...@lists.freedesktop.org Cc: intel-...@lists.freedesktop.org Cc: Chris Wilson Cc: Maarten Lankhorst Cc: Christian König Signed-off-by: Daniel Vetter LGTM. Perhaps some in-code documentation on how to use the new functions are called. Otherwise for patch 2 and 3, Reviewed-by: Thomas Hellstrom
Re: [PATCH 03/18] dma-fence: basic lockdep annotations
On 6/4/20 10:12 AM, Daniel Vetter wrote: ... Thread A: mutex_lock(A); mutex_unlock(A); dma_fence_signal(); Thread B: mutex_lock(A); dma_fence_wait(); mutex_unlock(A); Thread B is blocked on A signalling the fence, but A never gets around to that because it cannot acquire the lock A. Note that dma_fence_wait() is allowed to be nested within dma_fence_begin/end_signalling sections. To allow this to happen the read lock needs to be upgraded to a write lock, which means that any other lock is acquired between the dma_fence_begin_signalling() call and the call to dma_fence_wait(), and still held, this will result in an immediate lockdep complaint. The only other option would be to not annotate such calls, defeating the point. Therefore these annotations cannot be sprinkled over the code entirely mindless to avoid false positives. Just realized, isn't that example actually a true positive, or at least a great candidate for a true positive, since if another thread reenters that signaling path, it will block on that mutex, and the fence would never be signaled unless there is another signaling path? Although I agree the conclusion is sound: These annotations cannot be sprinkled mindlessly over the code. /Thomas v2: handle soft/hardirq ctx better against write side and dont forget EXPORT_SYMBOL, drivers can't use this otherwise. v3: Kerneldoc. v4: Some spelling fixes from Mika Cc: Mika Kuoppala Cc: Thomas Hellstrom Cc: linux-me...@vger.kernel.org Cc: linaro-mm-...@lists.linaro.org Cc: linux-r...@vger.kernel.org Cc: amd-...@lists.freedesktop.org Cc: intel-...@lists.freedesktop.org Cc: Chris Wilson Cc: Maarten Lankhorst Cc: Christian König Signed-off-by: Daniel Vetter --- Documentation/driver-api/dma-buf.rst | 12 +- drivers/dma-buf/dma-fence.c | 161 +++ include/linux/dma-fence.h| 12 ++ 3 files changed, 182 insertions(+), 3 deletions(-) diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst index 63dec76d1d8d..05d856131140 100644 --- a/Documentation/driver-api/dma-buf.rst +++ b/Documentation/driver-api/dma-buf.rst @@ -100,11 +100,11 @@ CPU Access to DMA Buffer Objects .. kernel-doc:: drivers/dma-buf/dma-buf.c :doc: cpu access -Fence Poll Support -~~ +Implicit Fence Poll Support +~~~ .. kernel-doc:: drivers/dma-buf/dma-buf.c - :doc: fence polling + :doc: implicit fence polling Kernel Functions and Structures Reference ~ @@ -133,6 +133,12 @@ DMA Fences .. kernel-doc:: drivers/dma-buf/dma-fence.c :doc: DMA fences overview +DMA Fence Signalling Annotations + + +.. kernel-doc:: drivers/dma-buf/dma-fence.c + :doc: fence signalling annotation + DMA Fences Functions Reference ~~ diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c index 656e9ac2d028..0005bc002529 100644 --- a/drivers/dma-buf/dma-fence.c +++ b/drivers/dma-buf/dma-fence.c @@ -110,6 +110,160 @@ u64 dma_fence_context_alloc(unsigned num) } EXPORT_SYMBOL(dma_fence_context_alloc); +/** + * DOC: fence signalling annotation + * + * Proving correctness of all the kernel code around &dma_fence through code + * review and testing is tricky for a few reasons: + * + * * It is a cross-driver contract, and therefore all drivers must follow the + * same rules for lock nesting order, calling contexts for various functions + * and anything else significant for in-kernel interfaces. But it is also + * impossible to test all drivers in a single machine, hence brute-force N vs. + * N testing of all combinations is impossible. Even just limiting to the + * possible combinations is infeasible. + * + * * There is an enormous amount of driver code involved. For render drivers + * there's the tail of command submission, after fences are published, + * scheduler code, interrupt and workers to process job completion, + * and timeout, gpu reset and gpu hang recovery code. Plus for integration + * with core mm with have &mmu_notifier, respectively &mmu_interval_notifier, + * and &shrinker. For modesetting drivers there's the commit tail functions + * between when fences for an atomic modeset are published, and when the + * corresponding vblank completes, including any interrupt processing and + * related workers. Auditing all that code, across all drivers, is not + * feasible. + * + * * Due to how many other subsystems are involved and the locking hierarchies + * this pulls in there is extremely thin wiggle-room for driver-specific + * differences. &dma_fence interacts with almost all of the core memory + * handling through page fault handlers via &dma_resv, dma_resv_lock() and + * dma_resv_unlock(). On the other side it
Re: [PATCH] dma-fence: basic lockdep annotations
does a dma_fence_signal, but is still critical for reaching completion of fences. One example would be a scheduler kthread which picks up jobs and pushes them into hardware, where the interrupt handler or another completion thread calls dma_fence_signal(). But if the scheduler thread hangs, then all the fences hang, hence we need to manually annotate it. cross-release aimed to solve this by chaining cross-release dependencies, but the dependency from scheduler thread to the completion interrupt handler goes through hw where cross-release code can't observe it. In short, without manual annotations and careful review of the start and end of critical sections, cross-relese dependency tracking doesn't work. We need explicit annotations. v2: handle soft/hardirq ctx better against write side and dont forget EXPORT_SYMBOL, drivers can't use this otherwise. v3: Kerneldoc. v4: Some spelling fixes from Mika v5: Amend commit message to explain in detail why cross-release isn't the solution. Cc: Mika Kuoppala Cc: Thomas Hellstrom Cc: linux-me...@vger.kernel.org Cc: linaro-mm-...@lists.linaro.org Cc: linux-r...@vger.kernel.org Cc: amd-...@lists.freedesktop.org Cc: intel-...@lists.freedesktop.org Cc: Chris Wilson Cc: Maarten Lankhorst Cc: Christian König Signed-off-by: Daniel Vetter --- Reviewed-by: Thomas Hellström
[PATCH v4 2/2] selftests/x86: Add a test to detect infinite sigtrap handler loop
When FRED is enabled, if the Trap Flag (TF) is set without an external debugger attached, it can lead to an infinite loop in the SIGTRAP handler. To avoid this, the software event flag in the augmented SS must be cleared, ensuring that no single-step trap remains pending when ERETU completes. This test checks for that specific scenario—verifying whether the kernel correctly prevents an infinite SIGTRAP loop in this edge case when FRED is enabled. The test should _always_ pass with IDT event delivery, thus no need to disable the test even when FRED is not enabled. Signed-off-by: Xin Li (Intel) --- Changes in this version: *) Address review comments from Sohil. --- tools/testing/selftests/x86/Makefile | 2 +- tools/testing/selftests/x86/sigtrap_loop.c | 97 ++ 2 files changed, 98 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/x86/sigtrap_loop.c diff --git a/tools/testing/selftests/x86/Makefile b/tools/testing/selftests/x86/Makefile index f703fcfe9f7c..83148875a12c 100644 --- a/tools/testing/selftests/x86/Makefile +++ b/tools/testing/selftests/x86/Makefile @@ -12,7 +12,7 @@ CAN_BUILD_WITH_NOPIE := $(shell ./check_cc.sh "$(CC)" trivial_program.c -no-pie) TARGETS_C_BOTHBITS := single_step_syscall sysret_ss_attrs syscall_nt test_mremap_vdso \ check_initial_reg_state sigreturn iopl ioperm \ - test_vsyscall mov_ss_trap \ + test_vsyscall mov_ss_trap sigtrap_loop \ syscall_arg_fault fsgsbase_restore sigaltstack TARGETS_C_BOTHBITS += nx_stack TARGETS_C_32BIT_ONLY := entry_from_vm86 test_syscall_vdso unwind_vdso \ diff --git a/tools/testing/selftests/x86/sigtrap_loop.c b/tools/testing/selftests/x86/sigtrap_loop.c new file mode 100644 index ..dfb05769551d --- /dev/null +++ b/tools/testing/selftests/x86/sigtrap_loop.c @@ -0,0 +1,97 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (C) 2025 Intel Corporation + */ +#define _GNU_SOURCE + +#include +#include +#include +#include +#include +#include + +#ifdef __x86_64__ +# define REG_IP REG_RIP +#else +# define REG_IP REG_EIP +#endif + +static void sethandler(int sig, void (*handler)(int, siginfo_t *, void *), int flags) +{ + struct sigaction sa; + + memset(&sa, 0, sizeof(sa)); + sa.sa_sigaction = handler; + sa.sa_flags = SA_SIGINFO | flags; + sigemptyset(&sa.sa_mask); + + if (sigaction(sig, &sa, 0)) + err(1, "sigaction"); + + return; +} + +static unsigned int loop_count_on_same_ip; + +static void sigtrap(int sig, siginfo_t *info, void *ctx_void) +{ + ucontext_t *ctx = (ucontext_t *)ctx_void; + static unsigned long last_trap_ip; + + if (last_trap_ip == ctx->uc_mcontext.gregs[REG_IP]) { + printf("\tTrapped at %016lx\n", last_trap_ip); + + /* +* If the same IP is hit more than 10 times in a row, it is +* _considered_ an infinite loop. +*/ + if (++loop_count_on_same_ip > 10) { + printf("[FAIL]\tDetected sigtrap infinite loop\n"); + exit(1); + } + + return; + } + + loop_count_on_same_ip = 0; + last_trap_ip = ctx->uc_mcontext.gregs[REG_IP]; + printf("\tTrapped at %016lx\n", last_trap_ip); +} + +int main(int argc, char *argv[]) +{ + sethandler(SIGTRAP, sigtrap, 0); + + /* +* Set the Trap Flag (TF) to single-step the test code, therefore to +* trigger a SIGTRAP signal after each instruction until the TF is +* cleared. +* +* Because the arithmetic flags are not significant here, the TF is +* set by pushing 0x302 onto the stack and then popping it into the +* flags register. +* +* Four instructions in the following asm code are executed with the +* TF set, thus the SIGTRAP handler is expected to run four times. +*/ + printf("[RUN]\tsigtrap infinite loop detection\n"); + asm volatile( +#ifdef __x86_64__ + /* Avoid clobbering the redzone */ + "sub $128, %rsp\n\t" +#endif + "push $0x302\n\t" + "popf\n\t" + "nop\n\t" + "nop\n\t" + "push $0x202\n\t" + "popf\n\t" +#ifdef __x86_64__ + "add $128, %rsp\n\t" +#endif + ); + + printf("[OK]\tNo sigtrap infinite loop detected\n"); + return 0; +} -- 2.49.0
[PATCH v4 1/2] x86/fred/signal: Prevent single-step upon ERETU completion
Clear the software event flag in the augmented SS to prevent infinite SIGTRAP handler loop if the trap flag (TF) is set without an external debugger attached. Following is a typical single-stepping flow for a user process: 1) The user process is prepared for single-stepping by setting RFLAGS.TF = 1. 2) When any instruction in user space completes, a #DB is triggered. 3) The kernel handles the #DB and returns to user space, invoking the SIGTRAP handler with RFLAGS.TF = 0. 4) After the SIGTRAP handler finishes, the user process performs a sigreturn syscall, restoring the original state, including RFLAGS.TF = 1. 5) Goto step 2. According to the FRED specification: A) Bit 17 in the augmented SS is designated as the software event flag, which is set to 1 for FRED event delivery of SYSCALL, SYSENTER, or INT n. B) If bit 17 of the augmented SS is 1 and ERETU would result in RFLAGS.TF = 1, a single-step trap will be pending upon completion of ERETU. In step 4) above, the software event flag is set upon the sigreturn syscall, and its corresponding ERETU would restore RFLAGS.TF = 1. This combination causes a pending single-step trap upon completion of ERETU. Therefore, another #DB is triggered before any user space instruction is executed, which leads to an infinite loop in which the SIGTRAP handler keeps being invoked on the same user space IP. Suggested-by: H. Peter Anvin (Intel) Signed-off-by: Xin Li (Intel) Cc: sta...@vger.kernel.org --- Change in v3: *) Use "#ifdef CONFIG_X86_FRED" instead of IS_ENABLED(CONFIG_X86_FRED) (Intel LKP). Change in v2: *) Remove the check cpu_feature_enabled(X86_FEATURE_FRED), because regs->fred_ss.swevent will always be 0 otherwise (H. Peter Anvin). --- arch/x86/include/asm/sighandling.h | 22 ++ arch/x86/kernel/signal_32.c| 4 arch/x86/kernel/signal_64.c| 4 3 files changed, 30 insertions(+) diff --git a/arch/x86/include/asm/sighandling.h b/arch/x86/include/asm/sighandling.h index e770c4fc47f4..b8481d33ba8e 100644 --- a/arch/x86/include/asm/sighandling.h +++ b/arch/x86/include/asm/sighandling.h @@ -24,4 +24,26 @@ int ia32_setup_rt_frame(struct ksignal *ksig, struct pt_regs *regs); int x64_setup_rt_frame(struct ksignal *ksig, struct pt_regs *regs); int x32_setup_rt_frame(struct ksignal *ksig, struct pt_regs *regs); +/* + * To prevent infinite SIGTRAP handler loop if the trap flag (TF) is set + * without an external debugger attached, clear the software event flag in + * the augmented SS, ensuring no single-step trap is pending upon ERETU + * completion. + * + * Note, this function should be called in sigreturn() before the original + * state is restored to make sure the TF is read from the entry frame. + */ +static __always_inline void prevent_single_step_upon_eretu(struct pt_regs *regs) +{ + /* +* If the trap flag (TF) is set, i.e., the sigreturn() SYSCALL instruction +* is being single-stepped, do not clear the software event flag in the +* augmented SS, thus a debugger won't skip over the following instruction. +*/ +#ifdef CONFIG_X86_FRED + if (!(regs->flags & X86_EFLAGS_TF)) + regs->fred_ss.swevent = 0; +#endif +} + #endif /* _ASM_X86_SIGHANDLING_H */ diff --git a/arch/x86/kernel/signal_32.c b/arch/x86/kernel/signal_32.c index 98123ff10506..42bbc42bd350 100644 --- a/arch/x86/kernel/signal_32.c +++ b/arch/x86/kernel/signal_32.c @@ -152,6 +152,8 @@ SYSCALL32_DEFINE0(sigreturn) struct sigframe_ia32 __user *frame = (struct sigframe_ia32 __user *)(regs->sp-8); sigset_t set; + prevent_single_step_upon_eretu(regs); + if (!access_ok(frame, sizeof(*frame))) goto badframe; if (__get_user(set.sig[0], &frame->sc.oldmask) @@ -175,6 +177,8 @@ SYSCALL32_DEFINE0(rt_sigreturn) struct rt_sigframe_ia32 __user *frame; sigset_t set; + prevent_single_step_upon_eretu(regs); + frame = (struct rt_sigframe_ia32 __user *)(regs->sp - 4); if (!access_ok(frame, sizeof(*frame))) diff --git a/arch/x86/kernel/signal_64.c b/arch/x86/kernel/signal_64.c index ee9453891901..d483b585c6c6 100644 --- a/arch/x86/kernel/signal_64.c +++ b/arch/x86/kernel/signal_64.c @@ -250,6 +250,8 @@ SYSCALL_DEFINE0(rt_sigreturn) sigset_t set; unsigned long uc_flags; + prevent_single_step_upon_eretu(regs); + frame = (struct rt_sigframe __user *)(regs->sp - sizeof(long)); if (!access_ok(frame, sizeof(*frame))) goto badframe; @@ -366,6 +368,8 @@ COMPAT_SYSCALL_DEFINE0(x32_rt_sigreturn) sigset_t set; unsigned long uc_flags; + prevent_single_step_upon_eretu(regs); + frame = (struct rt_sigframe_x32 __user *)(regs->sp - 8); if (!access_ok(frame, sizeof(*frame))) -- 2.49.0
[PATCH v4 0/2] x86/fred: Prevent single-step upon ERETU completion
IDT event delivery has a debug hole in which it does not generate #DB upon returning to userspace before the first userspace instruction is executed if the Trap Flag (TF) is set. FRED closes this hole by introducing a software event flag, i.e., bit 17 of the augmented SS: if the bit is set and ERETU would result in RFLAGS.TF = 1, a single-step trap will be pending upon completion of ERETU. However I overlooked properly setting and clearing the bit in different situations. Thus when FRED is enabled, if the Trap Flag (TF) is set without an external debugger attached, it can lead to an infinite loop in the SIGTRAP handler. To avoid this, the software event flag in the augmented SS must be cleared, ensuring that no single-step trap remains pending when ERETU completes. This patch set combines the fix [1] and its corresponding selftest [2] (requested by Dave Hansen) into one patch set. [1] https://lore.kernel.org/lkml/20250523050153.3308237-1-...@zytor.com/ [2] https://lore.kernel.org/lkml/20250530230707.2528916-1-...@zytor.com/ This patch set is based on tip/x86/urgent branch as of today. Xin Li (Intel) (2): x86/fred/signal: Prevent single-step upon ERETU completion selftests/x86: Add a test to detect infinite sigtrap handler loop arch/x86/include/asm/sighandling.h | 22 + arch/x86/kernel/signal_32.c| 4 + arch/x86/kernel/signal_64.c| 4 + tools/testing/selftests/x86/Makefile | 2 +- tools/testing/selftests/x86/sigtrap_loop.c | 97 ++ 5 files changed, 128 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/x86/sigtrap_loop.c base-commit: dd2922dcfaa3296846265e113309e5f7f138839f -- 2.49.0
[PATCH v5 0/2] x86/fred: Prevent immediate repeat of single step trap on return from SIGTRAP handler
IDT event delivery has a debug hole in which it does not generate #DB upon returning to userspace before the first userspace instruction is executed if the Trap Flag (TF) is set. FRED closes this hole by introducing a software event flag, i.e., bit 17 of the augmented SS: if the bit is set and ERETU would result in RFLAGS.TF = 1, a single-step trap will be pending upon completion of ERETU. However I overlooked properly setting and clearing the bit in different situations. Thus when FRED is enabled, if the Trap Flag (TF) is set without an external debugger attached, it can lead to an infinite loop in the SIGTRAP handler. To avoid this, the software event flag in the augmented SS must be cleared, ensuring that no single-step trap remains pending when ERETU completes. This patch set combines the fix [1] and its corresponding selftest [2] (requested by Dave Hansen) into one patch set. [1] https://lore.kernel.org/lkml/20250523050153.3308237-1-...@zytor.com/ [2] https://lore.kernel.org/lkml/20250530230707.2528916-1-...@zytor.com/ This patch set is based on tip/x86/urgent branch as of today. Link to v4 of this patch set: https://lore.kernel.org/lkml/20250605181020.590459-1-...@zytor.com/ Changes in v5: *) Accurately rephrase the shortlog (hpa). *) Do "sub $-128, %rsp" rather than "add $128, %rsp", which is more efficient in code size (hpa). *) Add TB from Sohil. *) Add Cc: sta...@vger.kernel.org to all patches. Xin Li (Intel) (2): x86/fred/signal: Prevent immediate repeat of single step trap on return from SIGTRAP handler selftests/x86: Add a test to detect infinite sigtrap handler loop arch/x86/include/asm/sighandling.h | 22 + arch/x86/kernel/signal_32.c| 4 + arch/x86/kernel/signal_64.c| 4 + tools/testing/selftests/x86/Makefile | 2 +- tools/testing/selftests/x86/sigtrap_loop.c | 98 ++ 5 files changed, 129 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/x86/sigtrap_loop.c base-commit: dd2922dcfaa3296846265e113309e5f7f138839f -- 2.49.0
[PATCH v5 1/2] x86/fred/signal: Prevent immediate repeat of single step trap on return from SIGTRAP handler
Clear the software event flag in the augmented SS to prevent immediate repeat of single step trap on return from SIGTRAP handler if the trap flag (TF) is set without an external debugger attached. Following is a typical single-stepping flow for a user process: 1) The user process is prepared for single-stepping by setting RFLAGS.TF = 1. 2) When any instruction in user space completes, a #DB is triggered. 3) The kernel handles the #DB and returns to user space, invoking the SIGTRAP handler with RFLAGS.TF = 0. 4) After the SIGTRAP handler finishes, the user process performs a sigreturn syscall, restoring the original state, including RFLAGS.TF = 1. 5) Goto step 2. According to the FRED specification: A) Bit 17 in the augmented SS is designated as the software event flag, which is set to 1 for FRED event delivery of SYSCALL, SYSENTER, or INT n. B) If bit 17 of the augmented SS is 1 and ERETU would result in RFLAGS.TF = 1, a single-step trap will be pending upon completion of ERETU. In step 4) above, the software event flag is set upon the sigreturn syscall, and its corresponding ERETU would restore RFLAGS.TF = 1. This combination causes a pending single-step trap upon completion of ERETU. Therefore, another #DB is triggered before any user space instruction is executed, which leads to an infinite loop in which the SIGTRAP handler keeps being invoked on the same user space IP. Suggested-by: H. Peter Anvin (Intel) Signed-off-by: Xin Li (Intel) Cc: sta...@vger.kernel.org --- Change in v5: *) Accurately rephrase the shortlog (hpa). Change in v4: *) Add a selftest to the patch set (Dave Hansen). Change in v3: *) Use "#ifdef CONFIG_X86_FRED" instead of IS_ENABLED(CONFIG_X86_FRED) (Intel LKP). Change in v2: *) Remove the check cpu_feature_enabled(X86_FEATURE_FRED), because regs->fred_ss.swevent will always be 0 otherwise (hpa). --- arch/x86/include/asm/sighandling.h | 22 ++ arch/x86/kernel/signal_32.c| 4 arch/x86/kernel/signal_64.c| 4 3 files changed, 30 insertions(+) diff --git a/arch/x86/include/asm/sighandling.h b/arch/x86/include/asm/sighandling.h index e770c4fc47f4..8727c7e21dd1 100644 --- a/arch/x86/include/asm/sighandling.h +++ b/arch/x86/include/asm/sighandling.h @@ -24,4 +24,26 @@ int ia32_setup_rt_frame(struct ksignal *ksig, struct pt_regs *regs); int x64_setup_rt_frame(struct ksignal *ksig, struct pt_regs *regs); int x32_setup_rt_frame(struct ksignal *ksig, struct pt_regs *regs); +/* + * To prevent immediate repeat of single step trap on return from SIGTRAP + * handler if the trap flag (TF) is set without an external debugger attached, + * clear the software event flag in the augmented SS, ensuring no single-step + * trap is pending upon ERETU completion. + * + * Note, this function should be called in sigreturn() before the original + * state is restored to make sure the TF is read from the entry frame. + */ +static __always_inline void prevent_single_step_upon_eretu(struct pt_regs *regs) +{ + /* +* If the trap flag (TF) is set, i.e., the sigreturn() SYSCALL instruction +* is being single-stepped, do not clear the software event flag in the +* augmented SS, thus a debugger won't skip over the following instruction. +*/ +#ifdef CONFIG_X86_FRED + if (!(regs->flags & X86_EFLAGS_TF)) + regs->fred_ss.swevent = 0; +#endif +} + #endif /* _ASM_X86_SIGHANDLING_H */ diff --git a/arch/x86/kernel/signal_32.c b/arch/x86/kernel/signal_32.c index 98123ff10506..42bbc42bd350 100644 --- a/arch/x86/kernel/signal_32.c +++ b/arch/x86/kernel/signal_32.c @@ -152,6 +152,8 @@ SYSCALL32_DEFINE0(sigreturn) struct sigframe_ia32 __user *frame = (struct sigframe_ia32 __user *)(regs->sp-8); sigset_t set; + prevent_single_step_upon_eretu(regs); + if (!access_ok(frame, sizeof(*frame))) goto badframe; if (__get_user(set.sig[0], &frame->sc.oldmask) @@ -175,6 +177,8 @@ SYSCALL32_DEFINE0(rt_sigreturn) struct rt_sigframe_ia32 __user *frame; sigset_t set; + prevent_single_step_upon_eretu(regs); + frame = (struct rt_sigframe_ia32 __user *)(regs->sp - 4); if (!access_ok(frame, sizeof(*frame))) diff --git a/arch/x86/kernel/signal_64.c b/arch/x86/kernel/signal_64.c index ee9453891901..d483b585c6c6 100644 --- a/arch/x86/kernel/signal_64.c +++ b/arch/x86/kernel/signal_64.c @@ -250,6 +250,8 @@ SYSCALL_DEFINE0(rt_sigreturn) sigset_t set; unsigned long uc_flags; + prevent_single_step_upon_eretu(regs); + frame = (struct rt_sigframe __user *)(regs->sp - sizeof(long)); if (!access_ok(frame, sizeof(*frame))) goto badframe; @@ -366,6 +368,8 @@ COMPAT_SYSCALL_DEFINE0(x32_rt_sigreturn) sigset_t set; unsigned long uc_flags; + prevent_single_step_upon_eretu(regs); + fr
[PATCH v5 2/2] selftests/x86: Add a test to detect infinite sigtrap handler loop
When FRED is enabled, if the Trap Flag (TF) is set without an external debugger attached, it can lead to an infinite loop in the SIGTRAP handler. To avoid this, the software event flag in the augmented SS must be cleared, ensuring that no single-step trap remains pending when ERETU completes. This test checks for that specific scenario—verifying whether the kernel correctly prevents an infinite SIGTRAP loop in this edge case when FRED is enabled. The test should _always_ pass with IDT event delivery, thus no need to disable the test even when FRED is not enabled. Signed-off-by: Xin Li (Intel) Tested-by: Sohil Mehta Cc: sta...@vger.kernel.org --- Changes in v5: *) Do "sub $-128, %rsp" rather than "add $128, %rsp", which is more efficient in code size (hpa). *) Add TB from Sohil. *) Add Cc: sta...@vger.kernel.org. Changes in v4: *) merge this selftest with its bug fix patch set (Dave Hansen). *) Address review comments from Sohil. --- tools/testing/selftests/x86/Makefile | 2 +- tools/testing/selftests/x86/sigtrap_loop.c | 98 ++ 2 files changed, 99 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/x86/sigtrap_loop.c diff --git a/tools/testing/selftests/x86/Makefile b/tools/testing/selftests/x86/Makefile index f703fcfe9f7c..83148875a12c 100644 --- a/tools/testing/selftests/x86/Makefile +++ b/tools/testing/selftests/x86/Makefile @@ -12,7 +12,7 @@ CAN_BUILD_WITH_NOPIE := $(shell ./check_cc.sh "$(CC)" trivial_program.c -no-pie) TARGETS_C_BOTHBITS := single_step_syscall sysret_ss_attrs syscall_nt test_mremap_vdso \ check_initial_reg_state sigreturn iopl ioperm \ - test_vsyscall mov_ss_trap \ + test_vsyscall mov_ss_trap sigtrap_loop \ syscall_arg_fault fsgsbase_restore sigaltstack TARGETS_C_BOTHBITS += nx_stack TARGETS_C_32BIT_ONLY := entry_from_vm86 test_syscall_vdso unwind_vdso \ diff --git a/tools/testing/selftests/x86/sigtrap_loop.c b/tools/testing/selftests/x86/sigtrap_loop.c new file mode 100644 index ..9eecf32c79c2 --- /dev/null +++ b/tools/testing/selftests/x86/sigtrap_loop.c @@ -0,0 +1,98 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (C) 2025 Intel Corporation + */ +#define _GNU_SOURCE + +#include +#include +#include +#include +#include +#include + +#ifdef __x86_64__ +# define REG_IP REG_RIP +#else +# define REG_IP REG_EIP +#endif + +static void sethandler(int sig, void (*handler)(int, siginfo_t *, void *), int flags) +{ + struct sigaction sa; + + memset(&sa, 0, sizeof(sa)); + sa.sa_sigaction = handler; + sa.sa_flags = SA_SIGINFO | flags; + sigemptyset(&sa.sa_mask); + + if (sigaction(sig, &sa, 0)) + err(1, "sigaction"); + + return; +} + +static unsigned int loop_count_on_same_ip; + +static void sigtrap(int sig, siginfo_t *info, void *ctx_void) +{ + ucontext_t *ctx = (ucontext_t *)ctx_void; + static unsigned long last_trap_ip; + + if (last_trap_ip == ctx->uc_mcontext.gregs[REG_IP]) { + printf("\tTrapped at %016lx\n", last_trap_ip); + + /* +* If the same IP is hit more than 10 times in a row, it is +* _considered_ an infinite loop. +*/ + if (++loop_count_on_same_ip > 10) { + printf("[FAIL]\tDetected sigtrap infinite loop\n"); + exit(1); + } + + return; + } + + loop_count_on_same_ip = 0; + last_trap_ip = ctx->uc_mcontext.gregs[REG_IP]; + printf("\tTrapped at %016lx\n", last_trap_ip); +} + +int main(int argc, char *argv[]) +{ + sethandler(SIGTRAP, sigtrap, 0); + + /* +* Set the Trap Flag (TF) to single-step the test code, therefore to +* trigger a SIGTRAP signal after each instruction until the TF is +* cleared. +* +* Because the arithmetic flags are not significant here, the TF is +* set by pushing 0x302 onto the stack and then popping it into the +* flags register. +* +* Four instructions in the following asm code are executed with the +* TF set, thus the SIGTRAP handler is expected to run four times. +*/ + printf("[RUN]\tsigtrap infinite loop detection\n"); + asm volatile( +#ifdef __x86_64__ + /* Avoid clobbering the redzone */ + "sub $128, %rsp\n\t" +#endif + "push $0x302\n\t" + "popf\n\t" + "nop\n\t" + "nop\n\t" + "push $0x202\n\t" + "popf\n\t" +#ifdef __x86_64__ + /* Equivalent to "add $128, %rsp" with 3 fewer bytes in encoding */ + "sub $-128, %rsp\n\t" +#endif + ); + + printf("[OK]\tNo sigtrap infinite loop detected\n"); + return 0; +} -- 2.49.0
[PATCH v6 1/2] x86/fred/signal: Prevent immediate repeat of single step trap on return from SIGTRAP handler
Clear the software event flag in the augmented SS to prevent immediate repeat of single step trap on return from SIGTRAP handler if the trap flag (TF) is set without an external debugger attached. Following is a typical single-stepping flow for a user process: 1) The user process is prepared for single-stepping by setting RFLAGS.TF = 1. 2) When any instruction in user space completes, a #DB is triggered. 3) The kernel handles the #DB and returns to user space, invoking the SIGTRAP handler with RFLAGS.TF = 0. 4) After the SIGTRAP handler finishes, the user process performs a sigreturn syscall, restoring the original state, including RFLAGS.TF = 1. 5) Goto step 2. According to the FRED specification: A) Bit 17 in the augmented SS is designated as the software event flag, which is set to 1 for FRED event delivery of SYSCALL, SYSENTER, or INT n. B) If bit 17 of the augmented SS is 1 and ERETU would result in RFLAGS.TF = 1, a single-step trap will be pending upon completion of ERETU. In step 4) above, the software event flag is set upon the sigreturn syscall, and its corresponding ERETU would restore RFLAGS.TF = 1. This combination causes a pending single-step trap upon completion of ERETU. Therefore, another #DB is triggered before any user space instruction is executed, which leads to an infinite loop in which the SIGTRAP handler keeps being invoked on the same user space IP. Suggested-by: H. Peter Anvin (Intel) Tested-by: Sohil Mehta Signed-off-by: Xin Li (Intel) Cc: sta...@vger.kernel.org --- Change in v6: *) Add TB from Sohil. Change in v5: *) Accurately rephrase the shortlog (hpa). Change in v4: *) Add a selftest to the patch set (Dave Hansen). Change in v3: *) Use "#ifdef CONFIG_X86_FRED" instead of IS_ENABLED(CONFIG_X86_FRED) (Intel LKP). Change in v2: *) Remove the check cpu_feature_enabled(X86_FEATURE_FRED), because regs->fred_ss.swevent will always be 0 otherwise (hpa). --- arch/x86/include/asm/sighandling.h | 22 ++ arch/x86/kernel/signal_32.c| 4 arch/x86/kernel/signal_64.c| 4 3 files changed, 30 insertions(+) diff --git a/arch/x86/include/asm/sighandling.h b/arch/x86/include/asm/sighandling.h index e770c4fc47f4..8727c7e21dd1 100644 --- a/arch/x86/include/asm/sighandling.h +++ b/arch/x86/include/asm/sighandling.h @@ -24,4 +24,26 @@ int ia32_setup_rt_frame(struct ksignal *ksig, struct pt_regs *regs); int x64_setup_rt_frame(struct ksignal *ksig, struct pt_regs *regs); int x32_setup_rt_frame(struct ksignal *ksig, struct pt_regs *regs); +/* + * To prevent immediate repeat of single step trap on return from SIGTRAP + * handler if the trap flag (TF) is set without an external debugger attached, + * clear the software event flag in the augmented SS, ensuring no single-step + * trap is pending upon ERETU completion. + * + * Note, this function should be called in sigreturn() before the original + * state is restored to make sure the TF is read from the entry frame. + */ +static __always_inline void prevent_single_step_upon_eretu(struct pt_regs *regs) +{ + /* +* If the trap flag (TF) is set, i.e., the sigreturn() SYSCALL instruction +* is being single-stepped, do not clear the software event flag in the +* augmented SS, thus a debugger won't skip over the following instruction. +*/ +#ifdef CONFIG_X86_FRED + if (!(regs->flags & X86_EFLAGS_TF)) + regs->fred_ss.swevent = 0; +#endif +} + #endif /* _ASM_X86_SIGHANDLING_H */ diff --git a/arch/x86/kernel/signal_32.c b/arch/x86/kernel/signal_32.c index 98123ff10506..42bbc42bd350 100644 --- a/arch/x86/kernel/signal_32.c +++ b/arch/x86/kernel/signal_32.c @@ -152,6 +152,8 @@ SYSCALL32_DEFINE0(sigreturn) struct sigframe_ia32 __user *frame = (struct sigframe_ia32 __user *)(regs->sp-8); sigset_t set; + prevent_single_step_upon_eretu(regs); + if (!access_ok(frame, sizeof(*frame))) goto badframe; if (__get_user(set.sig[0], &frame->sc.oldmask) @@ -175,6 +177,8 @@ SYSCALL32_DEFINE0(rt_sigreturn) struct rt_sigframe_ia32 __user *frame; sigset_t set; + prevent_single_step_upon_eretu(regs); + frame = (struct rt_sigframe_ia32 __user *)(regs->sp - 4); if (!access_ok(frame, sizeof(*frame))) diff --git a/arch/x86/kernel/signal_64.c b/arch/x86/kernel/signal_64.c index ee9453891901..d483b585c6c6 100644 --- a/arch/x86/kernel/signal_64.c +++ b/arch/x86/kernel/signal_64.c @@ -250,6 +250,8 @@ SYSCALL_DEFINE0(rt_sigreturn) sigset_t set; unsigned long uc_flags; + prevent_single_step_upon_eretu(regs); + frame = (struct rt_sigframe __user *)(regs->sp - sizeof(long)); if (!access_ok(frame, sizeof(*frame))) goto badframe; @@ -366,6 +368,8 @@ COMPAT_SYSCALL_DEFINE0(x32_rt_sigreturn) sigset_t set; unsigned long uc_flags;
[PATCH v6 2/2] selftests/x86: Add a test to detect infinite SIGTRAP handler loop
When FRED is enabled, if the Trap Flag (TF) is set without an external debugger attached, it can lead to an infinite loop in the SIGTRAP handler. To avoid this, the software event flag in the augmented SS must be cleared, ensuring that no single-step trap remains pending when ERETU completes. This test checks for that specific scenario—verifying whether the kernel correctly prevents an infinite SIGTRAP loop in this edge case when FRED is enabled. The test should _always_ pass with IDT event delivery, thus no need to disable the test even when FRED is not enabled. Tested-by: Sohil Mehta Signed-off-by: Xin Li (Intel) Cc: sta...@vger.kernel.org --- Changes in v6: *) Replace a "sub $128, %rsp" with "add $-128, %rsp" (hpa). *) Declared loop_count_on_same_ip inside sigtrap() (Sohil). *) s/sigtrap/SIGTRAP (Sohil). Changes in v5: *) Do "sub $-128, %rsp" rather than "add $128, %rsp", which is more efficient in code size (hpa). *) Add TB from Sohil. *) Add Cc: sta...@vger.kernel.org. Changes in v4: *) merge this selftest with its bug fix patch set (Dave Hansen). *) Address review comments from Sohil. --- tools/testing/selftests/x86/Makefile | 2 +- tools/testing/selftests/x86/sigtrap_loop.c | 101 + 2 files changed, 102 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/x86/sigtrap_loop.c diff --git a/tools/testing/selftests/x86/Makefile b/tools/testing/selftests/x86/Makefile index f703fcfe9f7c..83148875a12c 100644 --- a/tools/testing/selftests/x86/Makefile +++ b/tools/testing/selftests/x86/Makefile @@ -12,7 +12,7 @@ CAN_BUILD_WITH_NOPIE := $(shell ./check_cc.sh "$(CC)" trivial_program.c -no-pie) TARGETS_C_BOTHBITS := single_step_syscall sysret_ss_attrs syscall_nt test_mremap_vdso \ check_initial_reg_state sigreturn iopl ioperm \ - test_vsyscall mov_ss_trap \ + test_vsyscall mov_ss_trap sigtrap_loop \ syscall_arg_fault fsgsbase_restore sigaltstack TARGETS_C_BOTHBITS += nx_stack TARGETS_C_32BIT_ONLY := entry_from_vm86 test_syscall_vdso unwind_vdso \ diff --git a/tools/testing/selftests/x86/sigtrap_loop.c b/tools/testing/selftests/x86/sigtrap_loop.c new file mode 100644 index ..9d065479e89f --- /dev/null +++ b/tools/testing/selftests/x86/sigtrap_loop.c @@ -0,0 +1,101 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (C) 2025 Intel Corporation + */ +#define _GNU_SOURCE + +#include +#include +#include +#include +#include +#include + +#ifdef __x86_64__ +# define REG_IP REG_RIP +#else +# define REG_IP REG_EIP +#endif + +static void sethandler(int sig, void (*handler)(int, siginfo_t *, void *), int flags) +{ + struct sigaction sa; + + memset(&sa, 0, sizeof(sa)); + sa.sa_sigaction = handler; + sa.sa_flags = SA_SIGINFO | flags; + sigemptyset(&sa.sa_mask); + + if (sigaction(sig, &sa, 0)) + err(1, "sigaction"); + + return; +} + +static void sigtrap(int sig, siginfo_t *info, void *ctx_void) +{ + ucontext_t *ctx = (ucontext_t *)ctx_void; + static unsigned int loop_count_on_same_ip; + static unsigned long last_trap_ip; + + if (last_trap_ip == ctx->uc_mcontext.gregs[REG_IP]) { + printf("\tTrapped at %016lx\n", last_trap_ip); + + /* +* If the same IP is hit more than 10 times in a row, it is +* _considered_ an infinite loop. +*/ + if (++loop_count_on_same_ip > 10) { + printf("[FAIL]\tDetected SIGTRAP infinite loop\n"); + exit(1); + } + + return; + } + + loop_count_on_same_ip = 0; + last_trap_ip = ctx->uc_mcontext.gregs[REG_IP]; + printf("\tTrapped at %016lx\n", last_trap_ip); +} + +int main(int argc, char *argv[]) +{ + sethandler(SIGTRAP, sigtrap, 0); + + /* +* Set the Trap Flag (TF) to single-step the test code, therefore to +* trigger a SIGTRAP signal after each instruction until the TF is +* cleared. +* +* Because the arithmetic flags are not significant here, the TF is +* set by pushing 0x302 onto the stack and then popping it into the +* flags register. +* +* Four instructions in the following asm code are executed with the +* TF set, thus the SIGTRAP handler is expected to run four times. +*/ + printf("[RUN]\tSIGTRAP infinite loop detection\n"); + asm volatile( +#ifdef __x86_64__ + /* +* Avoid clobbering the redzone +* +* Equivalent to "sub $128, %rsp", however -128 can be encoded +* in a single byte immediate while 128 uses 4 bytes. +*/
[PATCH v6 0/2] x86/fred: Prevent immediate repeat of single step trap on return from SIGTRAP handler
IDT event delivery has a debug hole in which it does not generate #DB upon returning to userspace before the first userspace instruction is executed if the Trap Flag (TF) is set. FRED closes this hole by introducing a software event flag, i.e., bit 17 of the augmented SS: if the bit is set and ERETU would result in RFLAGS.TF = 1, a single-step trap will be pending upon completion of ERETU. However I overlooked properly setting and clearing the bit in different situations. Thus when FRED is enabled, if the Trap Flag (TF) is set without an external debugger attached, it can lead to an infinite loop in the SIGTRAP handler. To avoid this, the software event flag in the augmented SS must be cleared, ensuring that no single-step trap remains pending when ERETU completes. This patch set combines the fix [1] and its corresponding selftest [2] (requested by Dave Hansen) into one patch set. [1] https://lore.kernel.org/lkml/20250523050153.3308237-1-...@zytor.com/ [2] https://lore.kernel.org/lkml/20250530230707.2528916-1-...@zytor.com/ This patch set is based on tip/x86/urgent branch. Link to v5 of this patch set: https://lore.kernel.org/lkml/20250606174528.1004756-1-...@zytor.com/ Changes in v6: *) Replace a "sub $128, %rsp" with "add $-128, %rsp" (hpa). *) Declared loop_count_on_same_ip inside sigtrap() (Sohil). *) s/sigtrap/SIGTRAP (Sohil). *) Add TB from Sohil to the first patch. Xin Li (Intel) (2): x86/fred/signal: Prevent immediate repeat of single step trap on return from SIGTRAP handler selftests/x86: Add a test to detect infinite SIGTRAP handler loop arch/x86/include/asm/sighandling.h | 22 + arch/x86/kernel/signal_32.c| 4 + arch/x86/kernel/signal_64.c| 4 + tools/testing/selftests/x86/Makefile | 2 +- tools/testing/selftests/x86/sigtrap_loop.c | 101 + 5 files changed, 132 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/x86/sigtrap_loop.c base-commit: dd2922dcfaa3296846265e113309e5f7f138839f -- 2.49.0
[PATCH v1 1/1] selftests/x86: Add a test to detect infinite sigtrap handler loop
When FRED is enabled, if the Trap Flag (TF) is set without an external debugger attached, it can lead to an infinite loop in the SIGTRAP handler. To avoid this, the software event flag in the augmented SS must be cleared, ensuring that no single-step trap remains pending when ERETU completes. This test checks for that specific scenario—verifying whether the kernel correctly prevents an infinite SIGTRAP loop in this edge case. Signed-off-by: Xin Li (Intel) --- tools/testing/selftests/x86/Makefile | 2 +- .../selftests/x86/test_sigtrap_handler.c | 80 +++ 2 files changed, 81 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/x86/test_sigtrap_handler.c diff --git a/tools/testing/selftests/x86/Makefile b/tools/testing/selftests/x86/Makefile index f703fcfe9f7c..c486fd88ebb1 100644 --- a/tools/testing/selftests/x86/Makefile +++ b/tools/testing/selftests/x86/Makefile @@ -12,7 +12,7 @@ CAN_BUILD_WITH_NOPIE := $(shell ./check_cc.sh "$(CC)" trivial_program.c -no-pie) TARGETS_C_BOTHBITS := single_step_syscall sysret_ss_attrs syscall_nt test_mremap_vdso \ check_initial_reg_state sigreturn iopl ioperm \ - test_vsyscall mov_ss_trap \ + test_vsyscall mov_ss_trap test_sigtrap_handler \ syscall_arg_fault fsgsbase_restore sigaltstack TARGETS_C_BOTHBITS += nx_stack TARGETS_C_32BIT_ONLY := entry_from_vm86 test_syscall_vdso unwind_vdso \ diff --git a/tools/testing/selftests/x86/test_sigtrap_handler.c b/tools/testing/selftests/x86/test_sigtrap_handler.c new file mode 100644 index ..9c5c2cf0cf88 --- /dev/null +++ b/tools/testing/selftests/x86/test_sigtrap_handler.c @@ -0,0 +1,80 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (C) 2025 Intel Corporation + */ +#define _GNU_SOURCE + +#include +#include +#include +#include +#include +#include + +#ifdef __x86_64__ +# define REG_IP REG_RIP +#else +# define REG_IP REG_EIP +#endif + +static void sethandler(int sig, void (*handler)(int, siginfo_t *, void *), int flags) +{ + struct sigaction sa; + + memset(&sa, 0, sizeof(sa)); + sa.sa_sigaction = handler; + sa.sa_flags = SA_SIGINFO | flags; + sigemptyset(&sa.sa_mask); + + if (sigaction(sig, &sa, 0)) + err(1, "sigaction"); + + return; +} + +static unsigned int loop_count_on_same_ip; + +static void sigtrap(int sig, siginfo_t *info, void *ctx_void) +{ + ucontext_t *ctx = (ucontext_t *)ctx_void; + static unsigned long last_trap_ip; + + if (last_trap_ip == ctx->uc_mcontext.gregs[REG_IP]) { + printf("trapped on %016lx\n", last_trap_ip); + + if (++loop_count_on_same_ip > 10) { + printf("trap loop detected, test failed\n"); + exit(2); + } + + return; + } + + loop_count_on_same_ip = 0; + last_trap_ip = ctx->uc_mcontext.gregs[REG_IP]; + printf("trapped on %016lx\n", last_trap_ip); +} + +int main(int argc, char *argv[]) +{ + sethandler(SIGTRAP, sigtrap, 0); + + asm volatile( +#ifdef __x86_64__ + /* Avoid clobbering the redzone */ + "sub $128, %rsp\n\t" +#endif + "push $0x302\n\t" + "popf\n\t" + "nop\n\t" + "nop\n\t" + "push $0x202\n\t" + "popf\n\t" +#ifdef __x86_64__ + "add $128, %rsp\n\t" +#endif + ); + + printf("test passed\n"); + return 0; +} base-commit: 485d11d84a2452ac16466cc7ae041c93d38929bc -- 2.49.0
[PATCH v3 6/7] x86/tls,ptrace: provide regset access to the GDT
From: "H. Peter Anvin" Provide access to the user-visible part of the GDT via a regset in ptrace(). Note that we already provide a regset for the TLS area part of the GDT; these can trivially be unified by looking at the contents of the regset structure, especially since the TLS area is the only user-modifiable part of the GDT. Signed-off-by: H. Peter Anvin (Intel) Cc: Ingo Molnar Cc: Thomas Gleixner Cc: Andy Lutomirski Cc: Chang S. Bae Cc: Markus T. Metzger --- arch/x86/kernel/ptrace.c | 83 ++-- arch/x86/kernel/tls.c| 83 ++-- arch/x86/kernel/tls.h| 8 +++-- include/uapi/linux/elf.h | 1 + 4 files changed, 118 insertions(+), 57 deletions(-) diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c index e2ee403865eb..5ce10310f440 100644 --- a/arch/x86/kernel/ptrace.c +++ b/arch/x86/kernel/ptrace.c @@ -50,6 +50,7 @@ enum x86_regset { REGSET_XSTATE, REGSET_TLS, REGSET_IOPERM32, + REGSET_GDT }; struct pt_regs_offset { @@ -747,6 +748,60 @@ static int ioperm_get(struct task_struct *target, 0, IO_BITMAP_BYTES); } +/* + * These provide read access to the GDT. As the only part that is + * writable is the TLS area, that code is in tls.c. + */ +static int gdt_get(struct task_struct *target, + const struct user_regset *regset, + unsigned int pos, unsigned int count, + void *kbuf, void __user *ubuf) +{ + struct desc_struct gdt_copy[GDT_LAST_USER + 1]; + const struct desc_struct *p; + struct user_desc udesc; + unsigned int index, endindex; + int err; + + if (pos % sizeof(struct user_desc)) + return -EINVAL; + + /* Get a snapshot of the GDT from an arbitrary CPU */ + memcpy(gdt_copy, get_current_gdt_ro(), sizeof(gdt_copy)); + + /* Copy over the TLS area */ + memcpy(&gdt_copy[GDT_ENTRY_TLS_MIN], target->thread.tls_array, + sizeof(target->thread.tls_array)); + + /* Descriptor zero is never accessible */ + memset(&gdt_copy[0], 0, sizeof(gdt_copy[0])); + + index = pos/sizeof(struct user_desc); + endindex = index + count/sizeof(struct user_desc); + endindex = min_t(unsigned int, GDT_LAST_USER + 1 - regset->bias, + endindex); + + p = &gdt_copy[index + regset->bias]; + + while (count && index < endindex) { + fill_user_desc(&udesc, index++, p++); + err = user_regset_copyout(&pos, &count, &kbuf, &ubuf, &udesc, +pos, pos + sizeof(udesc)); + if (err) + return err; + } + + /* Return zero for the rest of the regset, if applicable. */ + return user_regset_copyout_zero(&pos, &count, &kbuf, &ubuf, 0, -1); +} + +static int gdt_active(struct task_struct *target, + const struct user_regset *regset) +{ + (void)target; + return GDT_LAST_USER + 1; +} + /* * Called by kernel/ptrace.c when detaching.. * @@ -1262,7 +1317,8 @@ static struct user_regset x86_64_regsets[] __ro_after_init = { .core_note_type = NT_PRFPREG, .n = sizeof(struct user_i387_struct) / sizeof(long), .size = sizeof(long), .align = sizeof(long), - .active = regset_xregset_fpregs_active, .get = xfpregs_get, .set = xfpregs_set + .active = regset_xregset_fpregs_active, .get = xfpregs_get, + .set = xfpregs_set }, [REGSET_XSTATE] = { .core_note_type = NT_X86_XSTATE, @@ -1276,6 +1332,14 @@ static struct user_regset x86_64_regsets[] __ro_after_init = { .size = sizeof(long), .align = sizeof(long), .active = ioperm_active, .get = ioperm_get }, + [REGSET_GDT] = { + .core_note_type = NT_X86_GDT, + .n = LDT_ENTRIES, /* Theoretical maximum */ + .size = sizeof(struct user_desc), + .align = sizeof(struct user_desc), + .active = gdt_active, .get = gdt_get, + .set = regset_gdt_set + }, }; static const struct user_regset_view user_x86_64_view = { @@ -1309,7 +1373,8 @@ static struct user_regset x86_32_regsets[] __ro_after_init = { .core_note_type = NT_PRXFPREG, .n = sizeof(struct user32_fxsr_struct) / sizeof(u32), .size = sizeof(u32), .align = sizeof(u32), - .active = regset_xregset_fpregs_active, .get = xfpregs_get, .set = xfpregs_set + .active = regset_xregset_fpregs_active, .get = xfpregs_get, + .set = xfpregs_set }, [REGSET_XSTATE] = { .core_note_type = NT_X86_XSTA
[PATCH v3 4/7] x86/tls: create an explicit config symbol for the TLS area in the GDT
From: "H. Peter Anvin" Instead of using X86_32 || IA32_EMULATION, which is really rather ugly in the Makefile especially, create a dedicated config symbol for the TLS area. This will be further used in subsequent patches. Signed-off-by: H. Peter Anvin (Intel) Cc: Ingo Molnar Cc: Thomas Gleixner Cc: Andy Lutomirski Cc: Chang S. Bae Cc: Markus T. Metzger --- arch/x86/Kconfig | 4 arch/x86/kernel/Makefile | 3 +-- 2 files changed, 5 insertions(+), 2 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index f1dbb4ee19d7..e48036e3f158 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -2937,6 +2937,10 @@ config X86_DMA_REMAP config HAVE_GENERIC_GUP def_bool y +config X86_TLS_AREA + def_bool y + depends on IA32_EMULATION || X86_32 + source "net/Kconfig" source "drivers/Kconfig" diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile index 02d6f5cf4e70..d26ef8a40a88 100644 --- a/arch/x86/kernel/Makefile +++ b/arch/x86/kernel/Makefile @@ -65,8 +65,7 @@ obj-y += resource.o obj-y += process.o obj-y += fpu/ obj-y += ptrace.o -obj-$(CONFIG_X86_32) += tls.o -obj-$(CONFIG_IA32_EMULATION) += tls.o +obj-$(CONFIG_X86_TLS_AREA) += tls.o obj-y += step.o obj-$(CONFIG_INTEL_TXT)+= tboot.o obj-$(CONFIG_ISA_DMA_API) += i8237.o -- 2.14.4
[PATCH v3 5/7] x86/segment: add #define for the last user-visible GDT slot
From: "H. Peter Anvin" We don't want to advertise to user space how many slots the kernel GDT has, but user space can trivially find out what the last user-accessible GDT slot is. Add a #define for that so we can use that in sizing a regset. Signed-off-by: H. Peter Anvin (Intel) Cc: Ingo Molnar Cc: Thomas Gleixner Cc: Andy Lutomirski Cc: Chang S. Bae Cc: Markus T. Metzger --- arch/x86/include/asm/segment.h | 10 ++ 1 file changed, 10 insertions(+) diff --git a/arch/x86/include/asm/segment.h b/arch/x86/include/asm/segment.h index e293c122d0d5..5eb809eec048 100644 --- a/arch/x86/include/asm/segment.h +++ b/arch/x86/include/asm/segment.h @@ -115,6 +115,11 @@ */ #define GDT_ENTRIES32 +/* + * Last user-visible GDT slot + */ +#define GDT_LAST_USER GDT_ENTRY_DEFAULT_USER_DS + /* * Segment selector values corresponding to the above entries: */ @@ -194,6 +199,11 @@ */ #define GDT_ENTRIES16 +/* + * Last user-visible GDT slot + */ +#define GDT_LAST_USER GDT_ENTRY_PER_CPU + /* * Segment selector values corresponding to the above entries: * -- 2.14.4
[PATCH v3 3/7] x86: move fill_user_desc() from tls.c to desc.h and add validity check
From: "H. Peter Anvin" This is generic code which is potentially useful in other contexts. Unfortunately modify_ldt() is kind of stupid in that it returns a descriptor in CPU format but takes a different format, but regsets *have* to operate differently. Signed-off-by: H. Peter Anvin (Intel) Cc: Ingo Molnar Cc: Thomas Gleixner Cc: Andy Lutomirski Cc: Chang S. Bae Cc: Markus T. Metzger --- arch/x86/include/asm/desc.h | 22 ++ arch/x86/kernel/tls.c | 19 --- 2 files changed, 22 insertions(+), 19 deletions(-) diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h index 13c5ee878a47..5e69f993c9ff 100644 --- a/arch/x86/include/asm/desc.h +++ b/arch/x86/include/asm/desc.h @@ -386,6 +386,28 @@ static inline void set_desc_limit(struct desc_struct *desc, unsigned long limit) desc->limit1 = (limit >> 16) & 0xf; } + +static inline void fill_user_desc(struct user_desc *info, int idx, + const struct desc_struct *desc) + +{ + memset(info, 0, sizeof(*info)); + info->entry_number = idx; + if (desc->s && desc->dpl == 3) { + info->base_addr = get_desc_base(desc); + info->limit = get_desc_limit(desc); + info->seg_32bit = desc->d; + info->contents = desc->type >> 2; + info->read_exec_only = !(desc->type & 2); + info->limit_in_pages = desc->g; + info->seg_not_present = !desc->p; + info->useable = desc->avl; +#ifdef CONFIG_X86_64 + info->lm = desc->l; +#endif + } +} + void update_intr_gate(unsigned int n, const void *addr); void alloc_intr_gate(unsigned int n, const void *addr); diff --git a/arch/x86/kernel/tls.c b/arch/x86/kernel/tls.c index a5b802a12212..7b8ecb760707 100644 --- a/arch/x86/kernel/tls.c +++ b/arch/x86/kernel/tls.c @@ -197,25 +197,6 @@ SYSCALL_DEFINE1(set_thread_area, struct user_desc __user *, u_info) * Get the current Thread-Local Storage area: */ -static void fill_user_desc(struct user_desc *info, int idx, - const struct desc_struct *desc) - -{ - memset(info, 0, sizeof(*info)); - info->entry_number = idx; - info->base_addr = get_desc_base(desc); - info->limit = get_desc_limit(desc); - info->seg_32bit = desc->d; - info->contents = desc->type >> 2; - info->read_exec_only = !(desc->type & 2); - info->limit_in_pages = desc->g; - info->seg_not_present = !desc->p; - info->useable = desc->avl; -#ifdef CONFIG_X86_64 - info->lm = desc->l; -#endif -} - int do_get_thread_area(struct task_struct *p, int idx, struct user_desc __user *u_info) { -- 2.14.4
[PATCH v3 0/7] x86/ptrace: regset access to the GDT and LDT
From: "H. Peter Anvin" Give a debugger access to the visible part of the GDT and LDT. This allows a debugger to find out what a particular segment descriptor corresponds to; e.g. if %cs is 16, 32, or 64 bits. v3: Requalify LDT segments for selectors that have actually changed. v2: Add missing change to elf.h for the very last patch. Cc: Ingo Molnar Cc: Thomas Gleixner Cc: Andy Lutomirski Cc: Chang S. Bae Cc: Markus T. Metzger Cc: H. Peter Anvin arch/x86/Kconfig | 4 + arch/x86/include/asm/desc.h| 24 +++- arch/x86/include/asm/ldt.h | 16 +++ arch/x86/include/asm/segment.h | 10 ++ arch/x86/kernel/Makefile | 3 +- arch/x86/kernel/ldt.c | 283 - arch/x86/kernel/ptrace.c | 103 ++- arch/x86/kernel/tls.c | 102 +-- arch/x86/kernel/tls.h | 8 +- include/uapi/linux/elf.h | 2 + 10 files changed, 413 insertions(+), 142 deletions(-)
[PATCH v3 7/7] x86/ldt,ptrace: provide regset access to the LDT
From: "H. Peter Anvin" Provide ptrace/regset access to the LDT, if one exists. This interface provides both read and write access. The write code is unified with modify_ldt(); the read code doesn't have enough similarity so it has been kept made separate. Signed-off-by: H. Peter Anvin (Intel) Cc: Ingo Molnar Cc: Thomas Gleixner Cc: Andy Lutomirski Cc: Chang S. Bae Cc: Markus T. Metzger --- arch/x86/include/asm/desc.h | 2 +- arch/x86/include/asm/ldt.h | 16 arch/x86/kernel/ldt.c | 209 +++- arch/x86/kernel/ptrace.c| 20 + include/uapi/linux/elf.h| 1 + 5 files changed, 204 insertions(+), 44 deletions(-) create mode 100644 arch/x86/include/asm/ldt.h diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h index 5e69f993c9ff..12a759d76314 100644 --- a/arch/x86/include/asm/desc.h +++ b/arch/x86/include/asm/desc.h @@ -3,7 +3,7 @@ #define _ASM_X86_DESC_H #include -#include +#include #include #include #include diff --git a/arch/x86/include/asm/ldt.h b/arch/x86/include/asm/ldt.h new file mode 100644 index ..302ec2d6d45d --- /dev/null +++ b/arch/x86/include/asm/ldt.h @@ -0,0 +1,16 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Ptrace interface to the LDT + * + */ + +#ifndef _ARCH_X86_INCLUDE_ASM_LDT_H + +#include +#include + +extern user_regset_active_fn regset_ldt_active; +extern user_regset_get_fn regset_ldt_get; +extern user_regset_set_fn regset_ldt_set; + +#endif /* _ARCH_X86_INCLUDE_ASM_LDT_H */ diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c index 601d24268a99..e80dfde1f82f 100644 --- a/arch/x86/kernel/ldt.c +++ b/arch/x86/kernel/ldt.c @@ -392,53 +392,39 @@ static int read_default_ldt(void __user *ptr, unsigned long bytecount) return bytecount; } -static int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode) + +static int do_write_ldt(const struct task_struct *target, + const void *kbuf, const void __user *ubuf, + unsigned long bytecount, unsigned int index, + bool oldmode) { - struct mm_struct *mm = current->mm; + struct mm_struct *mm = target->mm; struct ldt_struct *new_ldt, *old_ldt; - unsigned int old_nr_entries, new_nr_entries; + unsigned int count, old_nr_entries, new_nr_entries; struct user_desc ldt_info; + const struct user_desc *kptr = kbuf; + const struct user_desc __user *uptr = ubuf; struct desc_struct ldt; + unsigned short first_index, last_index; int error; - error = -EINVAL; - if (bytecount != sizeof(ldt_info)) - goto out; - error = -EFAULT; - if (copy_from_user(&ldt_info, ptr, sizeof(ldt_info))) - goto out; + if (bytecount % sizeof(ldt_info) || + bytecount >= LDT_ENTRY_SIZE*LDT_ENTRIES) + return -EINVAL; - error = -EINVAL; - if (ldt_info.entry_number >= LDT_ENTRIES) - goto out; - if (ldt_info.contents == 3) { - if (oldmode) - goto out; - if (ldt_info.seg_not_present == 0) - goto out; - } + count = bytecount/sizeof(ldt_info); + if (index >= LDT_ENTRIES || index + count > LDT_ENTRIES) + return -EINVAL; - if ((oldmode && !ldt_info.base_addr && !ldt_info.limit) || - LDT_empty(&ldt_info)) { - /* The user wants to clear the entry. */ - memset(&ldt, 0, sizeof(ldt)); - } else { - if (!IS_ENABLED(CONFIG_X86_16BIT) && !ldt_info.seg_32bit) { - error = -EINVAL; - goto out; - } - - fill_ldt(&ldt, &ldt_info); - if (oldmode) - ldt.avl = 0; - } + first_index = index; + last_index = index + count - 1; if (down_write_killable(&mm->context.ldt_usr_sem)) return -EINTR; - old_ldt = mm->context.ldt; + old_ldt= mm->context.ldt; old_nr_entries = old_ldt ? old_ldt->nr_entries : 0; - new_nr_entries = max(ldt_info.entry_number + 1, old_nr_entries); + new_nr_entries = max(index + count, old_nr_entries); error = -ENOMEM; new_ldt = alloc_ldt_struct(new_nr_entries); @@ -446,9 +432,46 @@ static int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode) goto out_unlock; if (old_ldt) - memcpy(new_ldt->entries, old_ldt->entries, old_nr_entries * LDT_ENTRY_SIZE); + memcpy(new_ldt->entries, old_ldt->entries, + old_nr_entries * LDT_ENTRY_SIZE); + + while (count--) { + error = -EFAULT; +
[PATCH v3 2/7] x86/ldt: use a common value for read_default_ldt()
From: "H. Peter Anvin" There is no point in having two different sizes for the "default ldt"; a concept which is obsolete anyway. Since this is kernel-dependent and not user-space dependent, a 32-bit app needs to be able to accept the 64-bit value anyway, so use that value, which is the larger of the two. Signed-off-by: H. Peter Anvin (Intel) Cc: Ingo Molnar Cc: Thomas Gleixner Cc: Andy Lutomirski Cc: Chang S. Bae Cc: Markus T. Metzger --- arch/x86/kernel/ldt.c | 8 ++-- 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c index 18e9f4c0633d..601d24268a99 100644 --- a/arch/x86/kernel/ldt.c +++ b/arch/x86/kernel/ldt.c @@ -383,12 +383,8 @@ static int read_ldt(void __user *ptr, unsigned long bytecount) static int read_default_ldt(void __user *ptr, unsigned long bytecount) { - /* CHECKME: Can we use _one_ random number ? */ -#ifdef CONFIG_X86_32 - unsigned long size = 5 * sizeof(struct desc_struct); -#else - unsigned long size = 128; -#endif + const unsigned long size = 128; + if (bytecount > size) bytecount = size; if (clear_user(ptr, bytecount)) -- 2.14.4
[PATCH v3 1/7] x86/ldt: refresh %fs and %gs in refresh_ldt_segments()
From: "H. Peter Anvin" It is not only %ds and %es which contain cached user descriptor information, %fs and %gs do as well. To make sure we don't do something stupid that will affect processes which wouldn't want this requalification, be more restrictive about which selector numbers will be requalified: they need to be LDT selectors (which by definition are never null), have an RPL of 3 (always the case in user space unless null), and match the updated descriptor. The infrastructure is set up to allow a range of descriptors; this will be used in a subsequent patch. Signed-off-by: H. Peter Anvin (Intel) Cc: Ingo Molnar Cc: Thomas Gleixner Cc: Andy Lutomirski Cc: Chang S. Bae Cc: Markus T. Metzger --- arch/x86/kernel/ldt.c | 70 +++ 1 file changed, 54 insertions(+), 16 deletions(-) diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c index c9b14020f4dd..18e9f4c0633d 100644 --- a/arch/x86/kernel/ldt.c +++ b/arch/x86/kernel/ldt.c @@ -29,36 +29,68 @@ #include #include -static void refresh_ldt_segments(void) -{ +struct flush_ldt_info { + struct mm_struct *mm; + unsigned short first_desc; + unsigned short last_desc; +}; + #ifdef CONFIG_X86_64 + +static inline bool +need_requalify(unsigned short sel, const struct flush_ldt_info *info) +{ + /* Must be an LDT segment descriptor with an RPL of 3 */ + if ((sel & (SEGMENT_TI_MASK|SEGMENT_RPL_MASK)) != (SEGMENT_LDT|3)) + return false; + + return sel >= info->first_desc && sel <= info->last_desc; +} + +static void refresh_ldt_segments(const struct flush_ldt_info *info) +{ unsigned short sel; /* -* Make sure that the cached DS and ES descriptors match the updated -* LDT. +* Make sure that the cached DS/ES/FS/GS descriptors +* match the updated LDT, if the specific selectors point +* to LDT entries that have changed. */ savesegment(ds, sel); - if ((sel & SEGMENT_TI_MASK) == SEGMENT_LDT) + if (need_requalify(sel, info)) loadsegment(ds, sel); savesegment(es, sel); - if ((sel & SEGMENT_TI_MASK) == SEGMENT_LDT) + if (need_requalify(sel, info)) loadsegment(es, sel); -#endif + + savesegment(fs, sel); + if (need_requalify(sel, info)) + loadsegment(fs, sel); + + savesegment(gs, sel); + if (need_requalify(sel, info)) + load_gs_index(sel); } +#else +/* On 32 bits, entry_32.S takes care of this on kernel exit */ +static void refresh_ldt_segments(const struct flush_ldt_info *info) +{ + (void)info; +} +#endif + /* context.lock is held by the task which issued the smp function call */ -static void flush_ldt(void *__mm) +static void flush_ldt(void *_info) { - struct mm_struct *mm = __mm; + const struct flush_ldt_info *info = _info; - if (this_cpu_read(cpu_tlbstate.loaded_mm) != mm) + if (this_cpu_read(cpu_tlbstate.loaded_mm) != info->mm) return; - load_mm_ldt(mm); - - refresh_ldt_segments(); + load_mm_ldt(info->mm); + refresh_ldt_segments(info); } /* The caller must call finalize_ldt_struct on the result. LDT starts zeroed. */ @@ -223,15 +255,21 @@ static void finalize_ldt_struct(struct ldt_struct *ldt) paravirt_alloc_ldt(ldt->entries, ldt->nr_entries); } -static void install_ldt(struct mm_struct *mm, struct ldt_struct *ldt) +static void install_ldt(struct mm_struct *mm, struct ldt_struct *ldt, + unsigned short first_index, unsigned short last_index) { + struct flush_ldt_info info; + mutex_lock(&mm->context.lock); /* Synchronizes with READ_ONCE in load_mm_ldt. */ smp_store_release(&mm->context.ldt, ldt); /* Activate the LDT for all CPUs using currents mm. */ - on_each_cpu_mask(mm_cpumask(mm), flush_ldt, mm, true); + info.mm = mm; + info.first_desc = (first_index << 3)|SEGMENT_LDT|3; + info.last_desc = (last_index << 3)|SEGMENT_LDT|3; + on_each_cpu_mask(mm_cpumask(mm), flush_ldt, &info, true); mutex_unlock(&mm->context.lock); } @@ -436,7 +474,7 @@ static int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode) goto out_unlock; } - install_ldt(mm, new_ldt); + install_ldt(mm, new_ldt, ldt_info.entry_number, ldt_info.entry_number); free_ldt_struct(old_ldt); error = 0; -- 2.14.4
[PATCH 4/5] arch/parisc, termios: use
The PARISC definition of termbits.h is almost identical to the generic one, so use the generic one and only redefine a handful of constants. Signed-off-by: H. Peter Anvin (Intel) Cc: "James E.J. Bottomley" Cc: Helge Deller Cc: Kate Stewart Cc: Thomas Gleixner Cc: Philippe Ombredanne Cc: Greg Kroah-Hartman Cc: Cc: Greg Kroah-Hartman Cc: Jiri Slaby --- arch/parisc/include/uapi/asm/termbits.h | 197 +--- 1 file changed, 4 insertions(+), 193 deletions(-) diff --git a/arch/parisc/include/uapi/asm/termbits.h b/arch/parisc/include/uapi/asm/termbits.h index 40e920f8d683..28ce9067f2e0 100644 --- a/arch/parisc/include/uapi/asm/termbits.h +++ b/arch/parisc/include/uapi/asm/termbits.h @@ -2,201 +2,12 @@ #ifndef __ARCH_PARISC_TERMBITS_H__ #define __ARCH_PARISC_TERMBITS_H__ -#include +#include -typedef unsigned char cc_t; -typedef unsigned int speed_t; -typedef unsigned int tcflag_t; - -#define NCCS 19 -struct termios { - tcflag_t c_iflag; /* input mode flags */ - tcflag_t c_oflag; /* output mode flags */ - tcflag_t c_cflag; /* control mode flags */ - tcflag_t c_lflag; /* local mode flags */ - cc_t c_line;/* line discipline */ - cc_t c_cc[NCCS];/* control characters */ -}; - -struct termios2 { - tcflag_t c_iflag; /* input mode flags */ - tcflag_t c_oflag; /* output mode flags */ - tcflag_t c_cflag; /* control mode flags */ - tcflag_t c_lflag; /* local mode flags */ - cc_t c_line;/* line discipline */ - cc_t c_cc[NCCS];/* control characters */ - speed_t c_ispeed; /* input speed */ - speed_t c_ospeed; /* output speed */ -}; - -struct ktermios { - tcflag_t c_iflag; /* input mode flags */ - tcflag_t c_oflag; /* output mode flags */ - tcflag_t c_cflag; /* control mode flags */ - tcflag_t c_lflag; /* local mode flags */ - cc_t c_line;/* line discipline */ - cc_t c_cc[NCCS];/* control characters */ - speed_t c_ispeed; /* input speed */ - speed_t c_ospeed; /* output speed */ -}; - -/* c_cc characters */ -#define VINTR 0 -#define VQUIT 1 -#define VERASE 2 -#define VKILL 3 -#define VEOF 4 -#define VTIME 5 -#define VMIN 6 -#define VSWTC 7 -#define VSTART 8 -#define VSTOP 9 -#define VSUSP 10 -#define VEOL 11 -#define VREPRINT 12 -#define VDISCARD 13 -#define VWERASE 14 -#define VLNEXT 15 -#define VEOL2 16 - - -/* c_iflag bits */ -#define IGNBRK 001 -#define BRKINT 002 -#define IGNPAR 004 -#define PARMRK 010 -#define INPCK 020 -#define ISTRIP 040 -#define INLCR 100 -#define IGNCR 200 -#define ICRNL 400 -#define IUCLC 0001000 -#define IXON 0002000 -#define IXANY 0004000 -#define IXOFF 001 +/* c_iflag bits that differ from the generic ABI */ +#undef IMAXBEL #define IMAXBEL004 +#undef IUTF8 #define IUTF8 010 -/* c_oflag bits */ -#define OPOST 001 -#define OLCUC 002 -#define ONLCR 004 -#define OCRNL 010 -#define ONOCR 020 -#define ONLRET 040 -#define OFILL 100 -#define OFDEL 200 -#define NLDLY 400 -#define NL0 000 -#define NL1 400 -#define CRDLY 0003000 -#define CR0 000 -#define CR1 0001000 -#define CR2 0002000 -#define CR3 0003000 -#define TABDLY 0014000 -#define TAB0 000 -#define TAB1 0004000 -#define TAB2 001 -#define TAB3 0014000 -#define XTABS0014000 -#define BSDLY 002 -#define BS0 000 -#define BS1 002 -#define VTDLY 004 -#define VT0 000 -#define VT1 004 -#define FFDLY 010 -#define FF0 000 -#define FF1 010 - -/* c_cflag bit meaning */ -#define CBAUD 0010017 -#define B0 000 /* hang up */ -#define B50001 -#define B75002 -#define B110 003 -#define B134 004 -#define B150 005 -#define B200 006 -#define B300 007 -#define B600 010 -#define B1200 011 -#define B1800 012 -#define B2400 013 -#define B4800 014 -#define B9600 015 -#define B19200 016 -#define B38400 017 -#define EXTA B19200 -#define EXTB B38400 -#define CSIZE 060 -#define CS5 000 -#define CS6 020 -#define CS7 040 -#define CS8 060 -#define CSTOPB 100 -#define CREAD 200 -#define PARENB 400 -#define PARODD 0001000 -#define HUPCL 0002000 -#define CLOCAL 0004000 -#define CBAUDEX 001 -#defineBOTHER 001 -#defineB57600 0010001 -#define B115200 0010002 -#define B230400 0010003 -#define B460800 0010004 -#define B50 0010005 -#define B5760
[PATCH 1/5] asm-generic, termios: add alias constants from MIPS
Some architectures, in this case MIPS, need a couple of legacy alias constants for bits. There really is no reason why we can't define them generically for all architectures. Signed-off-by: H. Peter Anvin (Intel) Cc: Arnd Bergmann Cc: Greg Kroah-Hartman Cc: Jiri Slaby linux-kernel@vger.kernel.org (open list) --- include/uapi/asm-generic/termbits.h | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/include/uapi/asm-generic/termbits.h b/include/uapi/asm-generic/termbits.h index 2fbaf9ae89dd..96ae175eec5b 100644 --- a/include/uapi/asm-generic/termbits.h +++ b/include/uapi/asm-generic/termbits.h @@ -8,7 +8,10 @@ typedef unsigned char cc_t; typedef unsigned int speed_t; typedef unsigned int tcflag_t; -#define NCCS 19 +#ifndef NCCS +# define NCCS 19 +#endif + struct termios { tcflag_t c_iflag; /* input mode flags */ tcflag_t c_oflag; /* output mode flags */ @@ -49,6 +52,7 @@ struct ktermios { #define VTIME 5 #define VMIN 6 #define VSWTC 7 +#define VSWTCH VSWTC #define VSTART 8 #define VSTOP 9 #define VSUSP 10 @@ -173,6 +177,7 @@ struct ktermios { #define ECHONL 100 #define NOFLSH 200 #define TOSTOP 400 +#define ITOSTOP TOSTOP #define ECHOCTL0001000 #define ECHOPRT0002000 #define ECHOKE 0004000 -- 2.14.4
[PATCH 2/5] arch/ia64, termios: use
The IA64 definition of termbits.h is identical to the generic. Signed-off-by: H. Peter Anvin (Intel) Cc: Tony Luck Cc: Fenghua Yu Cc: Kate Stewart Cc: Philippe Ombredanne Cc: Greg Kroah-Hartman Cc: Thomas Gleixner Cc: Cc: Greg Kroah-Hartman CC: Jiri Slaby --- arch/ia64/include/uapi/asm/termbits.h | 210 +- 1 file changed, 1 insertion(+), 209 deletions(-) diff --git a/arch/ia64/include/uapi/asm/termbits.h b/arch/ia64/include/uapi/asm/termbits.h index 000a1a297c75..3935b106de79 100644 --- a/arch/ia64/include/uapi/asm/termbits.h +++ b/arch/ia64/include/uapi/asm/termbits.h @@ -1,209 +1 @@ -/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ -#ifndef _ASM_IA64_TERMBITS_H -#define _ASM_IA64_TERMBITS_H - -/* - * Based on . - * - * Modified 1999 - * David Mosberger-Tang , Hewlett-Packard Co - * - * 99/01/28Added new baudrates - */ - -#include - -typedef unsigned char cc_t; -typedef unsigned int speed_t; -typedef unsigned int tcflag_t; - -#define NCCS 19 -struct termios { - tcflag_t c_iflag; /* input mode flags */ - tcflag_t c_oflag; /* output mode flags */ - tcflag_t c_cflag; /* control mode flags */ - tcflag_t c_lflag; /* local mode flags */ - cc_t c_line;/* line discipline */ - cc_t c_cc[NCCS];/* control characters */ -}; - -struct termios2 { - tcflag_t c_iflag; /* input mode flags */ - tcflag_t c_oflag; /* output mode flags */ - tcflag_t c_cflag; /* control mode flags */ - tcflag_t c_lflag; /* local mode flags */ - cc_t c_line;/* line discipline */ - cc_t c_cc[NCCS];/* control characters */ - speed_t c_ispeed; /* input speed */ - speed_t c_ospeed; /* output speed */ -}; - -struct ktermios { - tcflag_t c_iflag; /* input mode flags */ - tcflag_t c_oflag; /* output mode flags */ - tcflag_t c_cflag; /* control mode flags */ - tcflag_t c_lflag; /* local mode flags */ - cc_t c_line;/* line discipline */ - cc_t c_cc[NCCS];/* control characters */ - speed_t c_ispeed; /* input speed */ - speed_t c_ospeed; /* output speed */ -}; - -/* c_cc characters */ -#define VINTR 0 -#define VQUIT 1 -#define VERASE 2 -#define VKILL 3 -#define VEOF 4 -#define VTIME 5 -#define VMIN 6 -#define VSWTC 7 -#define VSTART 8 -#define VSTOP 9 -#define VSUSP 10 -#define VEOL 11 -#define VREPRINT 12 -#define VDISCARD 13 -#define VWERASE 14 -#define VLNEXT 15 -#define VEOL2 16 - -/* c_iflag bits */ -#define IGNBRK 001 -#define BRKINT 002 -#define IGNPAR 004 -#define PARMRK 010 -#define INPCK 020 -#define ISTRIP 040 -#define INLCR 100 -#define IGNCR 200 -#define ICRNL 400 -#define IUCLC 0001000 -#define IXON 0002000 -#define IXANY 0004000 -#define IXOFF 001 -#define IMAXBEL002 -#define IUTF8 004 - -/* c_oflag bits */ -#define OPOST 001 -#define OLCUC 002 -#define ONLCR 004 -#define OCRNL 010 -#define ONOCR 020 -#define ONLRET 040 -#define OFILL 100 -#define OFDEL 200 -#define NLDLY 400 -#define NL0 000 -#define NL1 400 -#define CRDLY 0003000 -#define CR0 000 -#define CR1 0001000 -#define CR2 0002000 -#define CR3 0003000 -#define TABDLY 0014000 -#define TAB0 000 -#define TAB1 0004000 -#define TAB2 001 -#define TAB3 0014000 -#define XTABS0014000 -#define BSDLY 002 -#define BS0 000 -#define BS1 002 -#define VTDLY 004 -#define VT0 000 -#define VT1 004 -#define FFDLY 010 -#define FF0 000 -#define FF1 010 - -/* c_cflag bit meaning */ -#define CBAUD 0010017 -#define B0000 /* hang up */ -#define B50 001 -#define B75 002 -#define B110 003 -#define B134 004 -#define B150 005 -#define B200 006 -#define B300 007 -#define B600 010 -#define B1200 011 -#define B1800 012 -#define B2400 013 -#define B4800 014 -#define B9600 015 -#define B19200016 -#define B38400017 -#define EXTA B19200 -#define EXTB B38400 -#define CSIZE 060 -#define CS5 000 -#define CS6 020 -#define CS7 040 -#define CS8 060 -#define CSTOPB 100 -#define CREAD 200 -#define PARENB 400 -#define PARODD 0001000 -#define HUPCL 0002000 -#define CLOCAL 0004000 -#define CBAUDEX 001 -#defineBOTHER 001 -#defineB57600 0010001 -#define B115200 0010002 -#define B230400 0010003 -#define B460800 0010004 -#define B50 0010005 -#define B576000 0010006 -#define B921600 0010007
[PATCH 5/5] arch/xtensa, termios: use
The Xtensa definition of termbits.h is identical to the generic one. Signed-off-by: H. Peter Anvin (Intel) Cc: Chris Zankel Cc: Max Filippov Cc: Thomas Gleixner Cc: Greg Kroah-Hartman Cc: Philippe Ombredanne Cc: Kate Stewart Cc: Cc: Greg Kroah-Hartman Cc: Jiri Slaby --- arch/xtensa/include/uapi/asm/termbits.h | 222 +--- 1 file changed, 1 insertion(+), 221 deletions(-) diff --git a/arch/xtensa/include/uapi/asm/termbits.h b/arch/xtensa/include/uapi/asm/termbits.h index d4206a7c5138..3935b106de79 100644 --- a/arch/xtensa/include/uapi/asm/termbits.h +++ b/arch/xtensa/include/uapi/asm/termbits.h @@ -1,221 +1 @@ -/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ -/* - * include/asm-xtensa/termbits.h - * - * Copied from SH. - * - * This file is subject to the terms and conditions of the GNU General Public - * License. See the file "COPYING" in the main directory of this archive - * for more details. - * - * Copyright (C) 2001 - 2005 Tensilica Inc. - */ - -#ifndef _XTENSA_TERMBITS_H -#define _XTENSA_TERMBITS_H - - -#include - -typedef unsigned char cc_t; -typedef unsigned int speed_t; -typedef unsigned int tcflag_t; - -#define NCCS 19 -struct termios { - tcflag_t c_iflag; /* input mode flags */ - tcflag_t c_oflag; /* output mode flags */ - tcflag_t c_cflag; /* control mode flags */ - tcflag_t c_lflag; /* local mode flags */ - cc_t c_line;/* line discipline */ - cc_t c_cc[NCCS];/* control characters */ -}; - -struct termios2 { - tcflag_t c_iflag; /* input mode flags */ - tcflag_t c_oflag; /* output mode flags */ - tcflag_t c_cflag; /* control mode flags */ - tcflag_t c_lflag; /* local mode flags */ - cc_t c_line;/* line discipline */ - cc_t c_cc[NCCS];/* control characters */ - speed_t c_ispeed; /* input speed */ - speed_t c_ospeed; /* output speed */ -}; - -struct ktermios { - tcflag_t c_iflag; /* input mode flags */ - tcflag_t c_oflag; /* output mode flags */ - tcflag_t c_cflag; /* control mode flags */ - tcflag_t c_lflag; /* local mode flags */ - cc_t c_line;/* line discipline */ - cc_t c_cc[NCCS];/* control characters */ - speed_t c_ispeed; /* input speed */ - speed_t c_ospeed; /* output speed */ -}; - -/* c_cc characters */ - -#define VINTR 0 -#define VQUIT 1 -#define VERASE 2 -#define VKILL 3 -#define VEOF 4 -#define VTIME 5 -#define VMIN 6 -#define VSWTC 7 -#define VSTART 8 -#define VSTOP 9 -#define VSUSP 10 -#define VEOL 11 -#define VREPRINT 12 -#define VDISCARD 13 -#define VWERASE 14 -#define VLNEXT 15 -#define VEOL2 16 - -/* c_iflag bits */ - -#define IGNBRK 001 -#define BRKINT 002 -#define IGNPAR 004 -#define PARMRK 010 -#define INPCK 020 -#define ISTRIP 040 -#define INLCR 100 -#define IGNCR 200 -#define ICRNL 400 -#define IUCLC 0001000 -#define IXON 0002000 -#define IXANY 0004000 -#define IXOFF 001 -#define IMAXBEL002 -#define IUTF8 004 - -/* c_oflag bits */ - -#define OPOST 001 -#define OLCUC 002 -#define ONLCR 004 -#define OCRNL 010 -#define ONOCR 020 -#define ONLRET 040 -#define OFILL 100 -#define OFDEL 200 -#define NLDLY 400 -#define NL0 000 -#define NL1 400 -#define CRDLY 0003000 -#define CR0 000 -#define CR1 0001000 -#define CR2 0002000 -#define CR3 0003000 -#define TABDLY 0014000 -#define TAB0 000 -#define TAB1 0004000 -#define TAB2 001 -#define TAB3 0014000 -#define XTABS0014000 -#define BSDLY 002 -#define BS0 000 -#define BS1 002 -#define VTDLY 004 -#define VT0 000 -#define VT1 004 -#define FFDLY 010 -#define FF0 000 -#define FF1 010 - -/* c_cflag bit meaning */ - -#define CBAUD 0010017 -#define B0000 /* hang up */ -#define B50 001 -#define B75 002 -#define B110 003 -#define B134 004 -#define B150 005 -#define B200 006 -#define B300 007 -#define B600 010 -#define B1200 011 -#define B1800 012 -#define B2400 013 -#define B4800 014 -#define B9600 015 -#define B19200016 -#define B38400017 -#define EXTA B19200 -#define EXTB B38400 -#define CSIZE 060 -#define CS5 000 -#define CS6 020 -#define CS7 040 -#define CS8 060 -#define CSTOPB 100 -#define CREAD 200 -#define PARENB 400 -#define PARODD 0001000 -#define HUPCL 0002000 -#define CLOCAL 0004000 -#define CBAUDEX 001 -#define
[PATCH 0/5] termios: remove arch redundancy in
is one of those files which define an ABI. Some were made different due to the desire to be compatible with legacy architectures, others were mostly direct copies of the i386 definitions, which are now in asm-generic. This folds the IA64, MIPS, PA-RISC, and Xtensa implementations into the generic one. IA64 and Xtensa are identical, MIPS and PA-RISC are trivially different and just need a handful of constants redefined. has a few very minor adjustments to allow this. arch/ia64/include/uapi/asm/termbits.h | 210 +- arch/mips/include/uapi/asm/ioctls.h | 2 + arch/mips/include/uapi/asm/termbits.h | 213 ++ arch/parisc/include/uapi/asm/termbits.h | 197 +--- arch/xtensa/include/uapi/asm/termbits.h | 222 +--- include/uapi/asm-generic/termbits.h | 7 +- 6 files changed, 27 insertions(+), 824 deletions(-) Signed-off-by: H. Peter Anvin (Intel) Cc: "James E.J. Bottomley" Cc: Arnd Bergmann Cc: Chris Zankel Cc: Fenghua Yu Cc: Greg Kroah-Hartman Cc: Helge Deller Cc: James Hogan Cc: Jiri Slaby Cc: Kate Stewart Cc: Max Filippov Cc: Paul Burton Cc: Philippe Ombredanne Cc: Ralf Baechle Cc: Thomas Gleixner Cc: Tony Luck Cc: Cc: Cc: Cc:
[PATCH 3/5] arch/mips, termios: use
The MIPS definition of termbits.h is almost identical to the generic one, so use the generic one and only redefine a handful of constants. Move TIOCSER_TEMT to ioctls.h where it lives for all other architectures. Signed-off-by: H. Peter Anvin (Intel) Cc: Ralf Baechle Cc: Paul Burton Cc: James Hogan Cc: Philippe Ombredanne Cc: Greg Kroah-Hartman Cc: Kate Stewart Cc: Thomas Gleixner Cc: Cc: Greg Kroah-Hartman Cc: Jiri Slaby --- arch/mips/include/uapi/asm/ioctls.h | 2 + arch/mips/include/uapi/asm/termbits.h | 213 +++--- 2 files changed, 15 insertions(+), 200 deletions(-) diff --git a/arch/mips/include/uapi/asm/ioctls.h b/arch/mips/include/uapi/asm/ioctls.h index 890245a9f0c4..a4e11e1438b5 100644 --- a/arch/mips/include/uapi/asm/ioctls.h +++ b/arch/mips/include/uapi/asm/ioctls.h @@ -114,4 +114,6 @@ #define TIOCMIWAIT 0x5491 /* wait for a change on serial input line(s) */ #define TIOCGICOUNT0x5492 /* read serial port inline interrupt counts */ +#define TIOCSER_TEMT 0x01/* Transmitter physically empty */ + #endif /* __ASM_IOCTLS_H */ diff --git a/arch/mips/include/uapi/asm/termbits.h b/arch/mips/include/uapi/asm/termbits.h index dfeffba729b7..a08fa0a697f5 100644 --- a/arch/mips/include/uapi/asm/termbits.h +++ b/arch/mips/include/uapi/asm/termbits.h @@ -11,218 +11,31 @@ #ifndef _ASM_TERMBITS_H #define _ASM_TERMBITS_H -#include +#define NCCS 23 +#include -typedef unsigned char cc_t; -typedef unsigned int speed_t; -typedef unsigned int tcflag_t; - -/* - * The ABI says nothing about NCC but seems to use NCCS as - * replacement for it in struct termio - */ -#define NCCS 23 -struct termios { - tcflag_t c_iflag; /* input mode flags */ - tcflag_t c_oflag; /* output mode flags */ - tcflag_t c_cflag; /* control mode flags */ - tcflag_t c_lflag; /* local mode flags */ - cc_t c_line;/* line discipline */ - cc_t c_cc[NCCS];/* control characters */ -}; - -struct termios2 { - tcflag_t c_iflag; /* input mode flags */ - tcflag_t c_oflag; /* output mode flags */ - tcflag_t c_cflag; /* control mode flags */ - tcflag_t c_lflag; /* local mode flags */ - cc_t c_line;/* line discipline */ - cc_t c_cc[NCCS];/* control characters */ - speed_t c_ispeed; /* input speed */ - speed_t c_ospeed; /* output speed */ -}; - -struct ktermios { - tcflag_t c_iflag; /* input mode flags */ - tcflag_t c_oflag; /* output mode flags */ - tcflag_t c_cflag; /* control mode flags */ - tcflag_t c_lflag; /* local mode flags */ - cc_t c_line;/* line discipline */ - cc_t c_cc[NCCS];/* control characters */ - speed_t c_ispeed; /* input speed */ - speed_t c_ospeed; /* output speed */ -}; - -/* c_cc characters */ -#define VINTR 0 /* Interrupt character [ISIG]. */ -#define VQUIT 1 /* Quit character [ISIG]. */ -#define VERASE 2 /* Erase character [ICANON]. */ -#define VKILL 3 /* Kill-line character [ICANON]. */ +/* c_cc characters that differ from the generic ABI */ +#undef VMIN #define VMIN4 /* Minimum number of bytes read at once [!ICANON]. */ -#define VTIME 5 /* Time-out value (tenths of a second) [!ICANON]. */ +#undef VEOL2 #define VEOL2 6 /* Second EOL character [ICANON]. */ -#define VSWTC 7 /* ??? */ -#define VSWTCH VSWTC -#define VSTART 8 /* Start (X-ON) character [IXON, IXOFF]. */ -#define VSTOP 9 /* Stop (X-OFF) character [IXON, IXOFF]. */ -#define VSUSP 10 /* Suspend character [ISIG]. */ -#if 0 -/* - * VDSUSP is not supported - */ -#define VDSUSP 11 /* Delayed suspend character [ISIG]. */ -#endif -#define VREPRINT 12 /* Reprint-line character [ICANON]. */ -#define VDISCARD 13 /* Discard character [IEXTEN]. */ -#define VWERASE14 /* Word-erase character [ICANON]. */ -#define VLNEXT 15 /* Literal-next character [IEXTEN]. */ +#undef VEOF #define VEOF 16 /* End-of-file character [ICANON]. */ +#undef VEOL #define VEOL 17 /* End-of-line character [ICANON]. */ -/* c_iflag bits */ -#define IGNBRK 001 /* Ignore break condition. */ -#define BRKINT 002 /* Signal interrupt on break. */ -#define IGNPAR 004 /* Ignore characters with parity errors. */ -#define
[tip:x86/pti] sched/smt: Make sched_smt_present track topology
Commit-ID: c5511d03ec090980732e929c318a7a6374b5550e Gitweb: https://git.kernel.org/tip/c5511d03ec090980732e929c318a7a6374b5550e Author: Peter Zijlstra (Intel) AuthorDate: Sun, 25 Nov 2018 19:33:36 +0100 Committer: Thomas Gleixner CommitDate: Wed, 28 Nov 2018 11:57:06 +0100 sched/smt: Make sched_smt_present track topology Currently the 'sched_smt_present' static key is enabled when at CPU bringup SMT topology is observed, but it is never disabled. However there is demand to also disable the key when the topology changes such that there is no SMT present anymore. Implement this by making the key count the number of cores that have SMT enabled. In particular, the SMT topology bits are set before interrrupts are enabled and similarly, are cleared after interrupts are disabled for the last time and the CPU dies. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Thomas Gleixner Reviewed-by: Ingo Molnar Cc: Andy Lutomirski Cc: Linus Torvalds Cc: Jiri Kosina Cc: Tom Lendacky Cc: Josh Poimboeuf Cc: Andrea Arcangeli Cc: David Woodhouse Cc: Tim Chen Cc: Andi Kleen Cc: Dave Hansen Cc: Casey Schaufler Cc: Asit Mallick Cc: Arjan van de Ven Cc: Jon Masters Cc: Waiman Long Cc: Greg KH Cc: Dave Stewart Cc: Kees Cook Cc: sta...@vger.kernel.org Link: https://lkml.kernel.org/r/20181125185004.246110...@linutronix.de --- kernel/sched/core.c | 19 +++ 1 file changed, 11 insertions(+), 8 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 091e089063be..6fedf3a98581 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5738,15 +5738,10 @@ int sched_cpu_activate(unsigned int cpu) #ifdef CONFIG_SCHED_SMT /* -* The sched_smt_present static key needs to be evaluated on every -* hotplug event because at boot time SMT might be disabled when -* the number of booted CPUs is limited. -* -* If then later a sibling gets hotplugged, then the key would stay -* off and SMT scheduling would never be functional. +* When going up, increment the number of cores with SMT present. */ - if (cpumask_weight(cpu_smt_mask(cpu)) > 1) - static_branch_enable_cpuslocked(&sched_smt_present); + if (cpumask_weight(cpu_smt_mask(cpu)) == 2) + static_branch_inc_cpuslocked(&sched_smt_present); #endif set_cpu_active(cpu, true); @@ -5790,6 +5785,14 @@ int sched_cpu_deactivate(unsigned int cpu) */ synchronize_rcu_mult(call_rcu, call_rcu_sched); +#ifdef CONFIG_SCHED_SMT + /* +* When going down, decrement the number of cores with SMT present. +*/ + if (cpumask_weight(cpu_smt_mask(cpu)) == 2) + static_branch_dec_cpuslocked(&sched_smt_present); +#endif + if (!sched_smp_initialized) return 0;
[tip:x86/boot] x86/kaslr, ACPI/NUMA: Fix KASLR build error
Commit-ID: 9d94e8b1d4f94a3c4cee5ad11a1be460cd070839 Gitweb: https://git.kernel.org/tip/9d94e8b1d4f94a3c4cee5ad11a1be460cd070839 Author: Peter Zijlstra (Intel) AuthorDate: Wed, 3 Oct 2018 14:41:27 +0200 Committer: Borislav Petkov CommitDate: Tue, 9 Oct 2018 12:30:25 +0200 x86/kaslr, ACPI/NUMA: Fix KASLR build error There is no point in trying to compile KASLR-specific code when there is no KASLR. [ bp: Move the whole crap into kaslr.c and make rand_mem_physical_padding static. Make kaslr_check_padding() weak to avoid build breakage on other architectures. ] Reported-by: Naresh Kamboju Reported-by: Mark Brown Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Borislav Petkov Cc: Cc: Cc: Cc: Cc: Cc: Link: http://lkml.kernel.org/r/20181003123402.ga15...@hirez.programming.kicks-ass.net --- arch/x86/include/asm/setup.h | 2 -- arch/x86/mm/kaslr.c | 19 ++- drivers/acpi/numa.c | 17 + 3 files changed, 23 insertions(+), 15 deletions(-) diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h index 65a5bf8f6aba..ae13bc974416 100644 --- a/arch/x86/include/asm/setup.h +++ b/arch/x86/include/asm/setup.h @@ -80,8 +80,6 @@ static inline unsigned long kaslr_offset(void) return (unsigned long)&_text - __START_KERNEL; } -extern int rand_mem_physical_padding; - /* * Do NOT EVER look at the BIOS memory size location. * It does not work on many machines. diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c index 00cf4cae38f5..b3471388288d 100644 --- a/arch/x86/mm/kaslr.c +++ b/arch/x86/mm/kaslr.c @@ -23,6 +23,7 @@ #include #include #include +#include #include #include @@ -40,7 +41,7 @@ */ static const unsigned long vaddr_end = CPU_ENTRY_AREA_BASE; -int __initdata rand_mem_physical_padding = CONFIG_RANDOMIZE_MEMORY_PHYSICAL_PADDING; +static int __initdata rand_mem_physical_padding = CONFIG_RANDOMIZE_MEMORY_PHYSICAL_PADDING; /* * Memory regions randomized by KASLR (except modules that use a separate logic * earlier during boot). The list is ordered based on virtual addresses. This @@ -70,6 +71,22 @@ static inline bool kaslr_memory_enabled(void) return kaslr_enabled() && !IS_ENABLED(CONFIG_KASAN); } +/* + * Check the padding size for KASLR is enough. + */ +void __init kaslr_check_padding(void) +{ + u64 max_possible_phys, max_actual_phys, threshold; + + max_actual_phys = roundup(PFN_PHYS(max_pfn), 1ULL << 40); + max_possible_phys = roundup(PFN_PHYS(max_possible_pfn), 1ULL << 40); + threshold = max_actual_phys + ((u64)rand_mem_physical_padding << 40); + + if (max_possible_phys > threshold) + pr_warn("Set 'rand_mem_physical_padding=%llu' to avoid memory hotadd failure.\n", + (max_possible_phys - max_actual_phys) >> 40); +} + static int __init rand_mem_physical_padding_setup(char *str) { int max_padding = (1 << (MAX_PHYSMEM_BITS - TB_SHIFT)) - 1; diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c index 3d69834c692f..ba62004f4d86 100644 --- a/drivers/acpi/numa.c +++ b/drivers/acpi/numa.c @@ -32,7 +32,6 @@ #include #include #include -#include static nodemask_t nodes_found_map = NODE_MASK_NONE; @@ -433,10 +432,12 @@ acpi_table_parse_srat(enum acpi_srat_type id, handler, max_entries); } +/* To be overridden by architectures */ +void __init __weak kaslr_check_padding(void) { } + int __init acpi_numa_init(void) { int cnt = 0; - u64 max_possible_phys, max_actual_phys, threshold; if (acpi_disabled) return -EINVAL; @@ -466,17 +467,9 @@ int __init acpi_numa_init(void) cnt = acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY, acpi_parse_memory_affinity, 0); - /* check the padding size for KASLR is enough. */ - if (parsed_numa_memblks && kaslr_enabled()) { - max_actual_phys = roundup(PFN_PHYS(max_pfn), 1ULL << 40); - max_possible_phys = roundup(PFN_PHYS(max_possible_pfn), 1ULL << 40); - threshold = max_actual_phys + ((u64)rand_mem_physical_padding << 40); + if (parsed_numa_memblks) + kaslr_check_padding(); - if (max_possible_phys > threshold) { - pr_warn("Set 'rand_mem_physical_padding=%llu' to avoid memory hotadd failure.\n", - (max_possible_phys - max_actual_phys) >> 40); - } - } } /* SLIT: System Locality Information Table */
[tip:perf/core] perf: Avoid horrible stack usage
Commit-ID: 86038c5ea81b519a8a1fcfcd5e4599aab0cdd119 Gitweb: http://git.kernel.org/tip/86038c5ea81b519a8a1fcfcd5e4599aab0cdd119 Author: Peter Zijlstra (Intel) AuthorDate: Tue, 16 Dec 2014 12:47:34 +0100 Committer: Ingo Molnar CommitDate: Wed, 14 Jan 2015 15:11:45 +0100 perf: Avoid horrible stack usage Both Linus (most recent) and Steve (a while ago) reported that perf related callbacks have massive stack bloat. The problem is that software events need a pt_regs in order to properly report the event location and unwind stack. And because we could not assume one was present we allocated one on stack and filled it with minimal bits required for operation. Now, pt_regs is quite large, so this is undesirable. Furthermore it turns out that most sites actually have a pt_regs pointer available, making this even more onerous, as the stack space is pointless waste. This patch addresses the problem by observing that software events have well defined nesting semantics, therefore we can use static per-cpu storage instead of on-stack. Linus made the further observation that all but the scheduler callers of perf_sw_event() have a pt_regs available, so we change the regular perf_sw_event() to require a valid pt_regs (where it used to be optional) and add perf_sw_event_sched() for the scheduler. We have a scheduler specific call instead of a more generic _noregs() like construct because we can assume non-recursion from the scheduler and thereby simplify the code further (_noregs would have to put the recursion context call inline in order to assertain which __perf_regs element to use). One last note on the implementation of perf_trace_buf_prepare(); we allow .regs = NULL for those cases where we already have a pt_regs pointer available and do not need another. Reported-by: Linus Torvalds Reported-by: Steven Rostedt Signed-off-by: Peter Zijlstra (Intel) Cc: Arnaldo Carvalho de Melo Cc: Javi Merino Cc: Linus Torvalds Cc: Mathieu Desnoyers Cc: Oleg Nesterov Cc: Paul Mackerras Cc: Petr Mladek Cc: Steven Rostedt Cc: Tom Zanussi Cc: Vaibhav Nagarnaik Link: http://lkml.kernel.org/r/20141216115041.gw3...@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar --- include/linux/ftrace_event.h| 2 +- include/linux/perf_event.h | 28 +--- include/trace/ftrace.h | 7 --- kernel/events/core.c| 23 +-- kernel/sched/core.c | 2 +- kernel/trace/trace_event_perf.c | 4 +++- kernel/trace/trace_kprobe.c | 4 ++-- kernel/trace/trace_syscalls.c | 4 ++-- kernel/trace/trace_uprobe.c | 2 +- 9 files changed, 52 insertions(+), 24 deletions(-) diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h index 0bebb5c..d36f68b 100644 --- a/include/linux/ftrace_event.h +++ b/include/linux/ftrace_event.h @@ -595,7 +595,7 @@ extern int ftrace_profile_set_filter(struct perf_event *event, int event_id, char *filter_str); extern void ftrace_profile_free_filter(struct perf_event *event); extern void *perf_trace_buf_prepare(int size, unsigned short type, - struct pt_regs *regs, int *rctxp); + struct pt_regs **regs, int *rctxp); static inline void perf_trace_buf_submit(void *raw_data, int size, int rctx, u64 addr, diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 4f7a61c..3a7bd80 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -665,6 +665,7 @@ static inline int is_software_event(struct perf_event *event) extern struct static_key perf_swevent_enabled[PERF_COUNT_SW_MAX]; +extern void ___perf_sw_event(u32, u64, struct pt_regs *, u64); extern void __perf_sw_event(u32, u64, struct pt_regs *, u64); #ifndef perf_arch_fetch_caller_regs @@ -689,14 +690,25 @@ static inline void perf_fetch_caller_regs(struct pt_regs *regs) static __always_inline void perf_sw_event(u32 event_id, u64 nr, struct pt_regs *regs, u64 addr) { - struct pt_regs hot_regs; + if (static_key_false(&perf_swevent_enabled[event_id])) + __perf_sw_event(event_id, nr, regs, addr); +} + +DECLARE_PER_CPU(struct pt_regs, __perf_regs[4]); +/* + * 'Special' version for the scheduler, it hard assumes no recursion, + * which is guaranteed by us not actually scheduling inside other swevents + * because those disable preemption. + */ +static __always_inline void +perf_sw_event_sched(u32 event_id, u64 nr, u64 addr) +{ if (static_key_false(&perf_swevent_enabled[event_id])) { - if (!regs) { - perf_fetch_caller_regs(&hot_regs); - regs = &hot_regs; - } - __perf_sw_event(event_id, nr, regs, addr); + struct pt_regs *regs = this_cpu_ptr(&__perf_regs[0]); + + perf_fetch_caller_regs(regs); + ___pe
[tip: sched/core] preempt/dynamic: Provide irqentry_exit_cond_resched() static call
The following commit has been merged into the sched/core branch of tip: Commit-ID: 40607ee97e4eec5655cc0f76a720bdc4c63a6434 Gitweb: https://git.kernel.org/tip/40607ee97e4eec5655cc0f76a720bdc4c63a6434 Author:Peter Zijlstra (Intel) AuthorDate:Mon, 18 Jan 2021 15:12:22 +01:00 Committer: Ingo Molnar CommitterDate: Wed, 17 Feb 2021 14:12:42 +01:00 preempt/dynamic: Provide irqentry_exit_cond_resched() static call Provide static call to control IRQ preemption (called in CONFIG_PREEMPT) so that we can override its behaviour when preempt= is overriden. Since the default behaviour is full preemption, its call is initialized to provide IRQ preemption when preempt= isn't passed. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Frederic Weisbecker Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Link: https://lkml.kernel.org/r/20210118141223.123667-8-frede...@kernel.org --- include/linux/entry-common.h | 4 kernel/entry/common.c| 10 +- 2 files changed, 13 insertions(+), 1 deletion(-) diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h index a104b29..883acef 100644 --- a/include/linux/entry-common.h +++ b/include/linux/entry-common.h @@ -2,6 +2,7 @@ #ifndef __LINUX_ENTRYCOMMON_H #define __LINUX_ENTRYCOMMON_H +#include #include #include #include @@ -454,6 +455,9 @@ irqentry_state_t noinstr irqentry_enter(struct pt_regs *regs); * Conditional reschedule with additional sanity checks. */ void irqentry_exit_cond_resched(void); +#ifdef CONFIG_PREEMPT_DYNAMIC +DECLARE_STATIC_CALL(irqentry_exit_cond_resched, irqentry_exit_cond_resched); +#endif /** * irqentry_exit - Handle return from exception that used irqentry_enter() diff --git a/kernel/entry/common.c b/kernel/entry/common.c index f9d491b..f09cae3 100644 --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -385,6 +385,9 @@ void irqentry_exit_cond_resched(void) preempt_schedule_irq(); } } +#ifdef CONFIG_PREEMPT_DYNAMIC +DEFINE_STATIC_CALL(irqentry_exit_cond_resched, irqentry_exit_cond_resched); +#endif noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state) { @@ -411,8 +414,13 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state) } instrumentation_begin(); - if (IS_ENABLED(CONFIG_PREEMPTION)) + if (IS_ENABLED(CONFIG_PREEMPTION)) { +#ifdef CONFIG_PREEMT_DYNAMIC + static_call(irqentry_exit_cond_resched)(); +#else irqentry_exit_cond_resched(); +#endif + } /* Covers both tracing and lockdep */ trace_hardirqs_on(); instrumentation_end();
[tip: sched/core] preempt/dynamic: Provide preempt_schedule[_notrace]() static calls
The following commit has been merged into the sched/core branch of tip: Commit-ID: 2c9a98d3bc808717ab63ad928a2b568967775388 Gitweb: https://git.kernel.org/tip/2c9a98d3bc808717ab63ad928a2b568967775388 Author:Peter Zijlstra (Intel) AuthorDate:Mon, 18 Jan 2021 15:12:21 +01:00 Committer: Ingo Molnar CommitterDate: Wed, 17 Feb 2021 14:12:42 +01:00 preempt/dynamic: Provide preempt_schedule[_notrace]() static calls Provide static calls to control preempt_schedule[_notrace]() (called in CONFIG_PREEMPT) so that we can override their behaviour when preempt= is overriden. Since the default behaviour is full preemption, both their calls are initialized to the arch provided wrapper, if any. [fweisbec: only define static calls when PREEMPT_DYNAMIC, make it less dependent on x86 with __preempt_schedule_func] Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Frederic Weisbecker Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Link: https://lkml.kernel.org/r/20210118141223.123667-7-frede...@kernel.org --- arch/x86/include/asm/preempt.h | 34 + kernel/sched/core.c| 12 - 2 files changed, 38 insertions(+), 8 deletions(-) diff --git a/arch/x86/include/asm/preempt.h b/arch/x86/include/asm/preempt.h index 69485ca..9b12dce 100644 --- a/arch/x86/include/asm/preempt.h +++ b/arch/x86/include/asm/preempt.h @@ -5,6 +5,7 @@ #include #include #include +#include DECLARE_PER_CPU(int, __preempt_count); @@ -103,16 +104,33 @@ static __always_inline bool should_resched(int preempt_offset) } #ifdef CONFIG_PREEMPTION - extern asmlinkage void preempt_schedule_thunk(void); -# define __preempt_schedule() \ - asm volatile ("call preempt_schedule_thunk" : ASM_CALL_CONSTRAINT) - extern asmlinkage void preempt_schedule(void); - extern asmlinkage void preempt_schedule_notrace_thunk(void); -# define __preempt_schedule_notrace() \ - asm volatile ("call preempt_schedule_notrace_thunk" : ASM_CALL_CONSTRAINT) +extern asmlinkage void preempt_schedule(void); +extern asmlinkage void preempt_schedule_thunk(void); + +#define __preempt_schedule_func preempt_schedule_thunk + +DECLARE_STATIC_CALL(preempt_schedule, __preempt_schedule_func); + +#define __preempt_schedule() \ +do { \ + __ADDRESSABLE(STATIC_CALL_KEY(preempt_schedule)); \ + asm volatile ("call " STATIC_CALL_TRAMP_STR(preempt_schedule) : ASM_CALL_CONSTRAINT); \ +} while (0) + +extern asmlinkage void preempt_schedule_notrace(void); +extern asmlinkage void preempt_schedule_notrace_thunk(void); + +#define __preempt_schedule_notrace_func preempt_schedule_notrace_thunk + +DECLARE_STATIC_CALL(preempt_schedule_notrace, __preempt_schedule_notrace_func); + +#define __preempt_schedule_notrace() \ +do { \ + __ADDRESSABLE(STATIC_CALL_KEY(preempt_schedule_notrace)); \ + asm volatile ("call " STATIC_CALL_TRAMP_STR(preempt_schedule_notrace) : ASM_CALL_CONSTRAINT); \ +} while (0) - extern asmlinkage void preempt_schedule_notrace(void); #endif #endif /* __ASM_PREEMPT_H */ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index f7c8fd8..880611c 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5265,6 +5265,12 @@ asmlinkage __visible void __sched notrace preempt_schedule(void) NOKPROBE_SYMBOL(preempt_schedule); EXPORT_SYMBOL(preempt_schedule); +#ifdef CONFIG_PREEMPT_DYNAMIC +DEFINE_STATIC_CALL(preempt_schedule, __preempt_schedule_func); +EXPORT_STATIC_CALL(preempt_schedule); +#endif + + /** * preempt_schedule_notrace - preempt_schedule called by tracing * @@ -5317,6 +5323,12 @@ asmlinkage __visible void __sched notrace preempt_schedule_notrace(void) } EXPORT_SYMBOL_GPL(preempt_schedule_notrace); +#ifdef CONFIG_PREEMPT_DYNAMIC +DEFINE_STATIC_CALL(preempt_schedule_notrace, __preempt_schedule_notrace_func); +EXPORT_STATIC_CALL(preempt_schedule_notrace); +#endif + + #endif /* CONFIG_PREEMPTION */ /*
[tip: sched/core] preempt/dynamic: Support dynamic preempt with preempt= boot option
The following commit has been merged into the sched/core branch of tip: Commit-ID: 826bfeb37bb4302ee6042f330c4c0c757152bdb8 Gitweb: https://git.kernel.org/tip/826bfeb37bb4302ee6042f330c4c0c757152bdb8 Author:Peter Zijlstra (Intel) AuthorDate:Mon, 18 Jan 2021 15:12:23 +01:00 Committer: Ingo Molnar CommitterDate: Wed, 17 Feb 2021 14:12:42 +01:00 preempt/dynamic: Support dynamic preempt with preempt= boot option Support the preempt= boot option and patch the static call sites accordingly. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Frederic Weisbecker Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Link: https://lkml.kernel.org/r/20210118141223.123667-9-frede...@kernel.org --- kernel/sched/core.c | 68 +++- 1 file changed, 67 insertions(+), 1 deletion(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 880611c..0c06717 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5328,9 +5328,75 @@ DEFINE_STATIC_CALL(preempt_schedule_notrace, __preempt_schedule_notrace_func); EXPORT_STATIC_CALL(preempt_schedule_notrace); #endif - #endif /* CONFIG_PREEMPTION */ +#ifdef CONFIG_PREEMPT_DYNAMIC + +#include + +/* + * SC:cond_resched + * SC:might_resched + * SC:preempt_schedule + * SC:preempt_schedule_notrace + * SC:irqentry_exit_cond_resched + * + * + * NONE: + * cond_resched <- __cond_resched + * might_resched <- RET0 + * preempt_schedule <- NOP + * preempt_schedule_notrace <- NOP + * irqentry_exit_cond_resched <- NOP + * + * VOLUNTARY: + * cond_resched <- __cond_resched + * might_resched <- __cond_resched + * preempt_schedule <- NOP + * preempt_schedule_notrace <- NOP + * irqentry_exit_cond_resched <- NOP + * + * FULL: + * cond_resched <- RET0 + * might_resched <- RET0 + * preempt_schedule <- preempt_schedule + * preempt_schedule_notrace <- preempt_schedule_notrace + * irqentry_exit_cond_resched <- irqentry_exit_cond_resched + */ +static int __init setup_preempt_mode(char *str) +{ + if (!strcmp(str, "none")) { + static_call_update(cond_resched, __cond_resched); + static_call_update(might_resched, (typeof(&__cond_resched)) __static_call_return0); + static_call_update(preempt_schedule, (typeof(&preempt_schedule)) NULL); + static_call_update(preempt_schedule_notrace, (typeof(&preempt_schedule_notrace)) NULL); + static_call_update(irqentry_exit_cond_resched, (typeof(&irqentry_exit_cond_resched)) NULL); + pr_info("Dynamic Preempt: %s\n", str); + } else if (!strcmp(str, "voluntary")) { + static_call_update(cond_resched, __cond_resched); + static_call_update(might_resched, __cond_resched); + static_call_update(preempt_schedule, (typeof(&preempt_schedule)) NULL); + static_call_update(preempt_schedule_notrace, (typeof(&preempt_schedule_notrace)) NULL); + static_call_update(irqentry_exit_cond_resched, (typeof(&irqentry_exit_cond_resched)) NULL); + pr_info("Dynamic Preempt: %s\n", str); + } else if (!strcmp(str, "full")) { + static_call_update(cond_resched, (typeof(&__cond_resched)) __static_call_return0); + static_call_update(might_resched, (typeof(&__cond_resched)) __static_call_return0); + static_call_update(preempt_schedule, __preempt_schedule_func); + static_call_update(preempt_schedule_notrace, __preempt_schedule_notrace_func); + static_call_update(irqentry_exit_cond_resched, irqentry_exit_cond_resched); + pr_info("Dynamic Preempt: %s\n", str); + } else { + pr_warn("Dynamic Preempt: Unsupported preempt mode %s, default to full\n", str); + return 1; + } + return 0; +} +__setup("preempt=", setup_preempt_mode); + +#endif /* CONFIG_PREEMPT_DYNAMIC */ + + /* * This is the entry point to schedule() from kernel preemption * off of irq context.
[tip: sched/core] preempt/dynamic: Provide cond_resched() and might_resched() static calls
The following commit has been merged into the sched/core branch of tip: Commit-ID: b965f1ddb47daa5b8b2e2bc9c921431236830367 Gitweb: https://git.kernel.org/tip/b965f1ddb47daa5b8b2e2bc9c921431236830367 Author:Peter Zijlstra (Intel) AuthorDate:Mon, 18 Jan 2021 15:12:20 +01:00 Committer: Ingo Molnar CommitterDate: Wed, 17 Feb 2021 14:12:42 +01:00 preempt/dynamic: Provide cond_resched() and might_resched() static calls Provide static calls to control cond_resched() (called in !CONFIG_PREEMPT) and might_resched() (called in CONFIG_PREEMPT_VOLUNTARY) to that we can override their behaviour when preempt= is overriden. Since the default behaviour is full preemption, both their calls are ignored when preempt= isn't passed. [fweisbec: branch might_resched() directly to __cond_resched(), only define static calls when PREEMPT_DYNAMIC] Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Frederic Weisbecker Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Link: https://lkml.kernel.org/r/20210118141223.123667-6-frede...@kernel.org --- include/linux/kernel.h | 23 +++ include/linux/sched.h | 27 --- kernel/sched/core.c| 16 +--- 3 files changed, 56 insertions(+), 10 deletions(-) diff --git a/include/linux/kernel.h b/include/linux/kernel.h index f7902d8..cfd3d34 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -15,7 +15,7 @@ #include #include #include - +#include #include #include @@ -81,11 +81,26 @@ struct pt_regs; struct user; #ifdef CONFIG_PREEMPT_VOLUNTARY -extern int _cond_resched(void); -# define might_resched() _cond_resched() + +extern int __cond_resched(void); +# define might_resched() __cond_resched() + +#elif defined(CONFIG_PREEMPT_DYNAMIC) + +extern int __cond_resched(void); + +DECLARE_STATIC_CALL(might_resched, __cond_resched); + +static __always_inline void might_resched(void) +{ + static_call(might_resched)(); +} + #else + # define might_resched() do { } while (0) -#endif + +#endif /* CONFIG_PREEMPT_* */ #ifdef CONFIG_DEBUG_ATOMIC_SLEEP extern void ___might_sleep(const char *file, int line, int preempt_offset); diff --git a/include/linux/sched.h b/include/linux/sched.h index e115222..2f35594 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1871,11 +1871,32 @@ static inline int test_tsk_need_resched(struct task_struct *tsk) * value indicates whether a reschedule was done in fact. * cond_resched_lock() will drop the spinlock before scheduling, */ -#ifndef CONFIG_PREEMPTION -extern int _cond_resched(void); +#if !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC) +extern int __cond_resched(void); + +#ifdef CONFIG_PREEMPT_DYNAMIC + +DECLARE_STATIC_CALL(cond_resched, __cond_resched); + +static __always_inline int _cond_resched(void) +{ + return static_call(cond_resched)(); +} + #else + +static inline int _cond_resched(void) +{ + return __cond_resched(); +} + +#endif /* CONFIG_PREEMPT_DYNAMIC */ + +#else + static inline int _cond_resched(void) { return 0; } -#endif + +#endif /* !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC) */ #define cond_resched() ({ \ ___might_sleep(__FILE__, __LINE__, 0); \ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 4afbdd2..f7c8fd8 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6785,17 +6785,27 @@ SYSCALL_DEFINE0(sched_yield) return 0; } -#ifndef CONFIG_PREEMPTION -int __sched _cond_resched(void) +#if !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC) +int __sched __cond_resched(void) { if (should_resched(0)) { preempt_schedule_common(); return 1; } +#ifndef CONFIG_PREEMPT_RCU rcu_all_qs(); +#endif return 0; } -EXPORT_SYMBOL(_cond_resched); +EXPORT_SYMBOL(__cond_resched); +#endif + +#ifdef CONFIG_PREEMPT_DYNAMIC +DEFINE_STATIC_CALL_RET0(cond_resched, __cond_resched); +EXPORT_STATIC_CALL(cond_resched); + +DEFINE_STATIC_CALL_RET0(might_resched, __cond_resched); +EXPORT_STATIC_CALL(might_resched); #endif /*
[tip: sched/core] preempt/dynamic: Support dynamic preempt with preempt= boot option
The following commit has been merged into the sched/core branch of tip: Commit-ID: 0e79823f55de3cff95894fbb40440b17910e7378 Gitweb: https://git.kernel.org/tip/0e79823f55de3cff95894fbb40440b17910e7378 Author:Peter Zijlstra (Intel) AuthorDate:Mon, 18 Jan 2021 15:12:23 +01:00 Committer: Peter Zijlstra CommitterDate: Tue, 09 Feb 2021 16:30:58 +01:00 preempt/dynamic: Support dynamic preempt with preempt= boot option Support the preempt= boot option and patch the static call sites accordingly. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Frederic Weisbecker Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20210118141223.123667-9-frede...@kernel.org --- kernel/sched/core.c | 68 +++- 1 file changed, 67 insertions(+), 1 deletion(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index cd0c46f..220393d 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5314,9 +5314,75 @@ DEFINE_STATIC_CALL(preempt_schedule_notrace, __preempt_schedule_notrace_func); EXPORT_STATIC_CALL(preempt_schedule_notrace); #endif - #endif /* CONFIG_PREEMPTION */ +#ifdef CONFIG_PREEMPT_DYNAMIC + +#include + +/* + * SC:cond_resched + * SC:might_resched + * SC:preempt_schedule + * SC:preempt_schedule_notrace + * SC:irqentry_exit_cond_resched + * + * + * NONE: + * cond_resched <- __cond_resched + * might_resched <- RET0 + * preempt_schedule <- NOP + * preempt_schedule_notrace <- NOP + * irqentry_exit_cond_resched <- NOP + * + * VOLUNTARY: + * cond_resched <- __cond_resched + * might_resched <- __cond_resched + * preempt_schedule <- NOP + * preempt_schedule_notrace <- NOP + * irqentry_exit_cond_resched <- NOP + * + * FULL: + * cond_resched <- RET0 + * might_resched <- RET0 + * preempt_schedule <- preempt_schedule + * preempt_schedule_notrace <- preempt_schedule_notrace + * irqentry_exit_cond_resched <- irqentry_exit_cond_resched + */ +static int __init setup_preempt_mode(char *str) +{ + if (!strcmp(str, "none")) { + static_call_update(cond_resched, __cond_resched); + static_call_update(might_resched, (typeof(&__cond_resched)) __static_call_return0); + static_call_update(preempt_schedule, (typeof(&preempt_schedule)) NULL); + static_call_update(preempt_schedule_notrace, (typeof(&preempt_schedule_notrace)) NULL); + static_call_update(irqentry_exit_cond_resched, (typeof(&irqentry_exit_cond_resched)) NULL); + pr_info("Dynamic Preempt: %s\n", str); + } else if (!strcmp(str, "voluntary")) { + static_call_update(cond_resched, __cond_resched); + static_call_update(might_resched, __cond_resched); + static_call_update(preempt_schedule, (typeof(&preempt_schedule)) NULL); + static_call_update(preempt_schedule_notrace, (typeof(&preempt_schedule_notrace)) NULL); + static_call_update(irqentry_exit_cond_resched, (typeof(&irqentry_exit_cond_resched)) NULL); + pr_info("Dynamic Preempt: %s\n", str); + } else if (!strcmp(str, "full")) { + static_call_update(cond_resched, (typeof(&__cond_resched)) __static_call_return0); + static_call_update(might_resched, (typeof(&__cond_resched)) __static_call_return0); + static_call_update(preempt_schedule, __preempt_schedule_func); + static_call_update(preempt_schedule_notrace, __preempt_schedule_notrace_func); + static_call_update(irqentry_exit_cond_resched, irqentry_exit_cond_resched); + pr_info("Dynamic Preempt: %s\n", str); + } else { + pr_warn("Dynamic Preempt: Unsupported preempt mode %s, default to full\n", str); + return 1; + } + return 0; +} +__setup("preempt=", setup_preempt_mode); + +#endif /* CONFIG_PREEMPT_DYNAMIC */ + + /* * This is the entry point to schedule() from kernel preemption * off of irq context.
[tip: sched/core] preempt/dynamic: Provide preempt_schedule[_notrace]() static calls
The following commit has been merged into the sched/core branch of tip: Commit-ID: 8c98e8cf723c3ab2ac924b0942dd3b8074f874e5 Gitweb: https://git.kernel.org/tip/8c98e8cf723c3ab2ac924b0942dd3b8074f874e5 Author:Peter Zijlstra (Intel) AuthorDate:Mon, 18 Jan 2021 15:12:21 +01:00 Committer: Peter Zijlstra CommitterDate: Fri, 05 Feb 2021 17:19:57 +01:00 preempt/dynamic: Provide preempt_schedule[_notrace]() static calls Provide static calls to control preempt_schedule[_notrace]() (called in CONFIG_PREEMPT) so that we can override their behaviour when preempt= is overriden. Since the default behaviour is full preemption, both their calls are initialized to the arch provided wrapper, if any. [fweisbec: only define static calls when PREEMPT_DYNAMIC, make it less dependent on x86 with __preempt_schedule_func] Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Frederic Weisbecker Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20210118141223.123667-7-frede...@kernel.org --- arch/x86/include/asm/preempt.h | 34 + kernel/sched/core.c| 12 - 2 files changed, 38 insertions(+), 8 deletions(-) diff --git a/arch/x86/include/asm/preempt.h b/arch/x86/include/asm/preempt.h index 69485ca..9b12dce 100644 --- a/arch/x86/include/asm/preempt.h +++ b/arch/x86/include/asm/preempt.h @@ -5,6 +5,7 @@ #include #include #include +#include DECLARE_PER_CPU(int, __preempt_count); @@ -103,16 +104,33 @@ static __always_inline bool should_resched(int preempt_offset) } #ifdef CONFIG_PREEMPTION - extern asmlinkage void preempt_schedule_thunk(void); -# define __preempt_schedule() \ - asm volatile ("call preempt_schedule_thunk" : ASM_CALL_CONSTRAINT) - extern asmlinkage void preempt_schedule(void); - extern asmlinkage void preempt_schedule_notrace_thunk(void); -# define __preempt_schedule_notrace() \ - asm volatile ("call preempt_schedule_notrace_thunk" : ASM_CALL_CONSTRAINT) +extern asmlinkage void preempt_schedule(void); +extern asmlinkage void preempt_schedule_thunk(void); + +#define __preempt_schedule_func preempt_schedule_thunk + +DECLARE_STATIC_CALL(preempt_schedule, __preempt_schedule_func); + +#define __preempt_schedule() \ +do { \ + __ADDRESSABLE(STATIC_CALL_KEY(preempt_schedule)); \ + asm volatile ("call " STATIC_CALL_TRAMP_STR(preempt_schedule) : ASM_CALL_CONSTRAINT); \ +} while (0) + +extern asmlinkage void preempt_schedule_notrace(void); +extern asmlinkage void preempt_schedule_notrace_thunk(void); + +#define __preempt_schedule_notrace_func preempt_schedule_notrace_thunk + +DECLARE_STATIC_CALL(preempt_schedule_notrace, __preempt_schedule_notrace_func); + +#define __preempt_schedule_notrace() \ +do { \ + __ADDRESSABLE(STATIC_CALL_KEY(preempt_schedule_notrace)); \ + asm volatile ("call " STATIC_CALL_TRAMP_STR(preempt_schedule_notrace) : ASM_CALL_CONSTRAINT); \ +} while (0) - extern asmlinkage void preempt_schedule_notrace(void); #endif #endif /* __ASM_PREEMPT_H */ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index bc7b00b..cd0c46f 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5251,6 +5251,12 @@ asmlinkage __visible void __sched notrace preempt_schedule(void) NOKPROBE_SYMBOL(preempt_schedule); EXPORT_SYMBOL(preempt_schedule); +#ifdef CONFIG_PREEMPT_DYNAMIC +DEFINE_STATIC_CALL(preempt_schedule, __preempt_schedule_func); +EXPORT_STATIC_CALL(preempt_schedule); +#endif + + /** * preempt_schedule_notrace - preempt_schedule called by tracing * @@ -5303,6 +5309,12 @@ asmlinkage __visible void __sched notrace preempt_schedule_notrace(void) } EXPORT_SYMBOL_GPL(preempt_schedule_notrace); +#ifdef CONFIG_PREEMPT_DYNAMIC +DEFINE_STATIC_CALL(preempt_schedule_notrace, __preempt_schedule_notrace_func); +EXPORT_STATIC_CALL(preempt_schedule_notrace); +#endif + + #endif /* CONFIG_PREEMPTION */ /*
[tip: sched/core] preempt/dynamic: Provide cond_resched() and might_resched() static calls
The following commit has been merged into the sched/core branch of tip: Commit-ID: bf3054bb801cf566e65e5f3d060435dbfa4a2f36 Gitweb: https://git.kernel.org/tip/bf3054bb801cf566e65e5f3d060435dbfa4a2f36 Author:Peter Zijlstra (Intel) AuthorDate:Mon, 18 Jan 2021 15:12:20 +01:00 Committer: Peter Zijlstra CommitterDate: Fri, 05 Feb 2021 17:19:56 +01:00 preempt/dynamic: Provide cond_resched() and might_resched() static calls Provide static calls to control cond_resched() (called in !CONFIG_PREEMPT) and might_resched() (called in CONFIG_PREEMPT_VOLUNTARY) to that we can override their behaviour when preempt= is overriden. Since the default behaviour is full preemption, both their calls are ignored when preempt= isn't passed. [fweisbec: branch might_resched() directly to __cond_resched(), only define static calls when PREEMPT_DYNAMIC] Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Frederic Weisbecker Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20210118141223.123667-6-frede...@kernel.org --- include/linux/kernel.h | 23 +++ include/linux/sched.h | 27 --- kernel/sched/core.c| 16 +--- 3 files changed, 56 insertions(+), 10 deletions(-) diff --git a/include/linux/kernel.h b/include/linux/kernel.h index f7902d8..cfd3d34 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -15,7 +15,7 @@ #include #include #include - +#include #include #include @@ -81,11 +81,26 @@ struct pt_regs; struct user; #ifdef CONFIG_PREEMPT_VOLUNTARY -extern int _cond_resched(void); -# define might_resched() _cond_resched() + +extern int __cond_resched(void); +# define might_resched() __cond_resched() + +#elif defined(CONFIG_PREEMPT_DYNAMIC) + +extern int __cond_resched(void); + +DECLARE_STATIC_CALL(might_resched, __cond_resched); + +static __always_inline void might_resched(void) +{ + static_call(might_resched)(); +} + #else + # define might_resched() do { } while (0) -#endif + +#endif /* CONFIG_PREEMPT_* */ #ifdef CONFIG_DEBUG_ATOMIC_SLEEP extern void ___might_sleep(const char *file, int line, int preempt_offset); diff --git a/include/linux/sched.h b/include/linux/sched.h index e115222..2f35594 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1871,11 +1871,32 @@ static inline int test_tsk_need_resched(struct task_struct *tsk) * value indicates whether a reschedule was done in fact. * cond_resched_lock() will drop the spinlock before scheduling, */ -#ifndef CONFIG_PREEMPTION -extern int _cond_resched(void); +#if !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC) +extern int __cond_resched(void); + +#ifdef CONFIG_PREEMPT_DYNAMIC + +DECLARE_STATIC_CALL(cond_resched, __cond_resched); + +static __always_inline int _cond_resched(void) +{ + return static_call(cond_resched)(); +} + #else + +static inline int _cond_resched(void) +{ + return __cond_resched(); +} + +#endif /* CONFIG_PREEMPT_DYNAMIC */ + +#else + static inline int _cond_resched(void) { return 0; } -#endif + +#endif /* !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC) */ #define cond_resched() ({ \ ___might_sleep(__FILE__, __LINE__, 0); \ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index be3a956..bc7b00b 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6771,17 +6771,27 @@ SYSCALL_DEFINE0(sched_yield) return 0; } -#ifndef CONFIG_PREEMPTION -int __sched _cond_resched(void) +#if !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC) +int __sched __cond_resched(void) { if (should_resched(0)) { preempt_schedule_common(); return 1; } +#ifndef CONFIG_PREEMPT_RCU rcu_all_qs(); +#endif return 0; } -EXPORT_SYMBOL(_cond_resched); +EXPORT_SYMBOL(__cond_resched); +#endif + +#ifdef CONFIG_PREEMPT_DYNAMIC +DEFINE_STATIC_CALL_RET0(cond_resched, __cond_resched); +EXPORT_STATIC_CALL(cond_resched); + +DEFINE_STATIC_CALL_RET0(might_resched, __cond_resched); +EXPORT_STATIC_CALL(might_resched); #endif /*
[tip: sched/core] preempt/dynamic: Support dynamic preempt with preempt= boot option
The following commit has been merged into the sched/core branch of tip: Commit-ID: 4f8a0041162ab74ba5f89ee58f8aad59c4c85d22 Gitweb: https://git.kernel.org/tip/4f8a0041162ab74ba5f89ee58f8aad59c4c85d22 Author:Peter Zijlstra (Intel) AuthorDate:Mon, 18 Jan 2021 15:12:23 +01:00 Committer: Peter Zijlstra CommitterDate: Fri, 05 Feb 2021 17:19:58 +01:00 preempt/dynamic: Support dynamic preempt with preempt= boot option Support the preempt= boot option and patch the static call sites accordingly. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Frederic Weisbecker Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20210118141223.123667-9-frede...@kernel.org --- kernel/sched/core.c | 67 +++- 1 file changed, 66 insertions(+), 1 deletion(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index cd0c46f..0562992 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -12,6 +12,7 @@ #include "sched.h" +#include #include #include @@ -5314,9 +5315,73 @@ DEFINE_STATIC_CALL(preempt_schedule_notrace, __preempt_schedule_notrace_func); EXPORT_STATIC_CALL(preempt_schedule_notrace); #endif - #endif /* CONFIG_PREEMPTION */ +#ifdef CONFIG_PREEMPT_DYNAMIC + +/* + * SC:cond_resched + * SC:might_resched + * SC:preempt_schedule + * SC:preempt_schedule_notrace + * SC:irqentry_exit_cond_resched + * + * + * NONE: + * cond_resched <- __cond_resched + * might_resched <- RET0 + * preempt_schedule <- NOP + * preempt_schedule_notrace <- NOP + * irqentry_exit_cond_resched <- NOP + * + * VOLUNTARY: + * cond_resched <- __cond_resched + * might_resched <- __cond_resched + * preempt_schedule <- NOP + * preempt_schedule_notrace <- NOP + * irqentry_exit_cond_resched <- NOP + * + * FULL: + * cond_resched <- RET0 + * might_resched <- RET0 + * preempt_schedule <- preempt_schedule + * preempt_schedule_notrace <- preempt_schedule_notrace + * irqentry_exit_cond_resched <- irqentry_exit_cond_resched + */ +static int __init setup_preempt_mode(char *str) +{ + if (!strcmp(str, "none")) { + static_call_update(cond_resched, __cond_resched); + static_call_update(might_resched, (typeof(&__cond_resched)) __static_call_return0); + static_call_update(preempt_schedule, (typeof(&preempt_schedule)) NULL); + static_call_update(preempt_schedule_notrace, (typeof(&preempt_schedule_notrace)) NULL); + static_call_update(irqentry_exit_cond_resched, (typeof(&irqentry_exit_cond_resched)) NULL); + pr_info("Dynamic Preempt: %s\n", str); + } else if (!strcmp(str, "voluntary")) { + static_call_update(cond_resched, __cond_resched); + static_call_update(might_resched, __cond_resched); + static_call_update(preempt_schedule, (typeof(&preempt_schedule)) NULL); + static_call_update(preempt_schedule_notrace, (typeof(&preempt_schedule_notrace)) NULL); + static_call_update(irqentry_exit_cond_resched, (typeof(&irqentry_exit_cond_resched)) NULL); + pr_info("Dynamic Preempt: %s\n", str); + } else if (!strcmp(str, "full")) { + static_call_update(cond_resched, (typeof(&__cond_resched)) __static_call_return0); + static_call_update(might_resched, (typeof(&__cond_resched)) __static_call_return0); + static_call_update(preempt_schedule, __preempt_schedule_func); + static_call_update(preempt_schedule_notrace, __preempt_schedule_notrace_func); + static_call_update(irqentry_exit_cond_resched, irqentry_exit_cond_resched); + pr_info("Dynamic Preempt: %s\n", str); + } else { + pr_warn("Dynamic Preempt: Unsupported preempt mode %s, default to full\n", str); + return 1; + } + return 0; +} +__setup("preempt=", setup_preempt_mode); + +#endif /* CONFIG_PREEMPT_DYNAMIC */ + + /* * This is the entry point to schedule() from kernel preemption * off of irq context.
[tip: sched/core] preempt/dynamic: Provide irqentry_exit_cond_resched() static call
The following commit has been merged into the sched/core branch of tip: Commit-ID: 74345075999752a7a9c805fe5e2ec770345cd1ca Gitweb: https://git.kernel.org/tip/74345075999752a7a9c805fe5e2ec770345cd1ca Author:Peter Zijlstra (Intel) AuthorDate:Mon, 18 Jan 2021 15:12:22 +01:00 Committer: Peter Zijlstra CommitterDate: Fri, 05 Feb 2021 17:19:57 +01:00 preempt/dynamic: Provide irqentry_exit_cond_resched() static call Provide static call to control IRQ preemption (called in CONFIG_PREEMPT) so that we can override its behaviour when preempt= is overriden. Since the default behaviour is full preemption, its call is initialized to provide IRQ preemption when preempt= isn't passed. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Frederic Weisbecker Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20210118141223.123667-8-frede...@kernel.org --- include/linux/entry-common.h | 4 kernel/entry/common.c| 10 +- 2 files changed, 13 insertions(+), 1 deletion(-) diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h index ca86a00..1401c93 100644 --- a/include/linux/entry-common.h +++ b/include/linux/entry-common.h @@ -2,6 +2,7 @@ #ifndef __LINUX_ENTRYCOMMON_H #define __LINUX_ENTRYCOMMON_H +#include #include #include #include @@ -453,6 +454,9 @@ irqentry_state_t noinstr irqentry_enter(struct pt_regs *regs); * Conditional reschedule with additional sanity checks. */ void irqentry_exit_cond_resched(void); +#ifdef CONFIG_PREEMPT_DYNAMIC +DECLARE_STATIC_CALL(irqentry_exit_cond_resched, irqentry_exit_cond_resched); +#endif /** * irqentry_exit - Handle return from exception that used irqentry_enter() diff --git a/kernel/entry/common.c b/kernel/entry/common.c index 3783416..84fa7ec 100644 --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -393,6 +393,9 @@ void irqentry_exit_cond_resched(void) preempt_schedule_irq(); } } +#ifdef CONFIG_PREEMPT_DYNAMIC +DEFINE_STATIC_CALL(irqentry_exit_cond_resched, irqentry_exit_cond_resched); +#endif noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state) { @@ -419,8 +422,13 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state) } instrumentation_begin(); - if (IS_ENABLED(CONFIG_PREEMPTION)) + if (IS_ENABLED(CONFIG_PREEMPTION)) { +#ifdef CONFIG_PREEMT_DYNAMIC + static_call(irqentry_exit_cond_resched)(); +#else irqentry_exit_cond_resched(); +#endif + } /* Covers both tracing and lockdep */ trace_hardirqs_on(); instrumentation_end();
[tip:perf/urgent] perf/x86: Fix embarrasing typo
Commit-ID: ce5686d4ed12158599d2042a6c8659254ed263ce Gitweb: http://git.kernel.org/tip/ce5686d4ed12158599d2042a6c8659254ed263ce Author: Peter Zijlstra (Intel) AuthorDate: Wed, 29 Oct 2014 11:17:04 +0100 Committer: Ingo Molnar CommitDate: Tue, 4 Nov 2014 07:06:58 +0100 perf/x86: Fix embarrasing typo Because we're all human and typing sucks.. Fixes: 7fb0f1de49fc ("perf/x86: Fix compile warnings for intel_uncore") Reported-by: Andi Kleen Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: x...@kernel.org Link: http://lkml.kernel.org/n/tip-be0bftjh8yfm4uvmvtf3y...@git.kernel.org Signed-off-by: Ingo Molnar --- arch/x86/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index ded8a67..41a503c 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -144,7 +144,7 @@ config INSTRUCTION_DECODER config PERF_EVENTS_INTEL_UNCORE def_bool y - depends on PERF_EVENTS && SUP_SUP_INTEL && PCI + depends on PERF_EVENTS && CPU_SUP_INTEL && PCI config OUTPUT_FORMAT string -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:perf/urgent] perf: Fix bogus kernel printk
Commit-ID: 65d71fe1375b973083733294795bf2b09d45b3c2 Gitweb: http://git.kernel.org/tip/65d71fe1375b973083733294795bf2b09d45b3c2 Author: Peter Zijlstra (Intel) AuthorDate: Tue, 7 Oct 2014 19:07:33 +0200 Committer: Ingo Molnar CommitDate: Tue, 28 Oct 2014 10:51:01 +0100 perf: Fix bogus kernel printk Andy spotted the fail in what was intended as a conditional printk level. Reported-by: Andy Lutomirski Fixes: cc6cd47e7395 ("perf/x86: Tone down kernel messages when the PMU check fails in a virtual environment") Signed-off-by: Peter Zijlstra (Intel) Cc: Arnaldo Carvalho de Melo Cc: Linus Torvalds Link: http://lkml.kernel.org/r/20141007124757.gh19...@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar --- arch/x86/kernel/cpu/perf_event.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c index 1b8299d..66451a6 100644 --- a/arch/x86/kernel/cpu/perf_event.c +++ b/arch/x86/kernel/cpu/perf_event.c @@ -243,8 +243,9 @@ static bool check_hw_exists(void) msr_fail: printk(KERN_CONT "Broken PMU hardware detected, using software events only.\n"); - printk(boot_cpu_has(X86_FEATURE_HYPERVISOR) ? KERN_INFO : KERN_ERR - "Failed to access perfctr msr (MSR %x is %Lx)\n", reg, val_new); + printk("%sFailed to access perfctr msr (MSR %x is %Lx)\n", + boot_cpu_has(X86_FEATURE_HYPERVISOR) ? KERN_INFO : KERN_ERR, + reg, val_new); return false; } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:smp/hotplug] sched: Allow per-cpu kernel threads to run on online && !active
Commit-ID: 618d6e31623149c6203b46850e2e76ee0f29e577 Gitweb: http://git.kernel.org/tip/618d6e31623149c6203b46850e2e76ee0f29e577 Author: Peter Zijlstra (Intel) AuthorDate: Thu, 10 Mar 2016 12:54:08 +0100 Committer: Thomas Gleixner CommitDate: Thu, 5 May 2016 13:17:52 +0200 sched: Allow per-cpu kernel threads to run on online && !active In order to enable symmetric hotplug, we must mirror the online && !active state of cpu-down on the cpu-up side. However, to retain sanity, limit this state to per-cpu kthreads. Aside from the change to set_cpus_allowed_ptr(), which allow moving the per-cpu kthreads on, the other critical piece is the cpu selection for pinned tasks in select_task_rq(). This avoids dropping into select_fallback_rq(). select_fallback_rq() cannot be allowed to select !active cpus because its used to migrate user tasks away. And we do not want to move user tasks onto cpus that are in transition. Requested-by: Thomas Gleixner Signed-off-by: Peter Zijlstra (Intel) Tested-by: Thomas Gleixner Cc: Lai Jiangshan Cc: Jan H. Schönherr Cc: Oleg Nesterov Cc: r...@linutronix.de Link: http://lkml.kernel.org/r/20160301152303.gv6...@twins.programming.kicks-ass.net Signed-off-by: Thomas Gleixner --- arch/powerpc/kernel/smp.c | 2 +- arch/s390/kernel/smp.c| 2 +- include/linux/cpumask.h | 6 ++ kernel/sched/core.c | 49 --- 4 files changed, 46 insertions(+), 13 deletions(-) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 8cac1eb..55c924b 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -565,7 +565,7 @@ int __cpu_up(unsigned int cpu, struct task_struct *tidle) smp_ops->give_timebase(); /* Wait until cpu puts itself in the online & active maps */ - while (!cpu_online(cpu) || !cpu_active(cpu)) + while (!cpu_online(cpu)) cpu_relax(); return 0; diff --git a/arch/s390/kernel/smp.c b/arch/s390/kernel/smp.c index 40a6b4f..7b89a75 100644 --- a/arch/s390/kernel/smp.c +++ b/arch/s390/kernel/smp.c @@ -832,7 +832,7 @@ int __cpu_up(unsigned int cpu, struct task_struct *tidle) pcpu_attach_task(pcpu, tidle); pcpu_start_fn(pcpu, smp_start_secondary, NULL); /* Wait until cpu puts itself in the online & active maps */ - while (!cpu_online(cpu) || !cpu_active(cpu)) + while (!cpu_online(cpu)) cpu_relax(); return 0; } diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h index 40cee6b..e828cf6 100644 --- a/include/linux/cpumask.h +++ b/include/linux/cpumask.h @@ -743,12 +743,10 @@ set_cpu_present(unsigned int cpu, bool present) static inline void set_cpu_online(unsigned int cpu, bool online) { - if (online) { + if (online) cpumask_set_cpu(cpu, &__cpu_online_mask); - cpumask_set_cpu(cpu, &__cpu_active_mask); - } else { + else cpumask_clear_cpu(cpu, &__cpu_online_mask); - } } static inline void diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 8b489fc..8bfd7d4 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1082,13 +1082,21 @@ void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask) static int __set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask, bool check) { + const struct cpumask *cpu_valid_mask = cpu_active_mask; + unsigned int dest_cpu; unsigned long flags; struct rq *rq; - unsigned int dest_cpu; int ret = 0; rq = task_rq_lock(p, &flags); + if (p->flags & PF_KTHREAD) { + /* +* Kernel threads are allowed on online && !active CPUs +*/ + cpu_valid_mask = cpu_online_mask; + } + /* * Must re-check here, to close a race against __kthread_bind(), * sched_setaffinity() is not guaranteed to observe the flag. @@ -1101,18 +1109,28 @@ static int __set_cpus_allowed_ptr(struct task_struct *p, if (cpumask_equal(&p->cpus_allowed, new_mask)) goto out; - if (!cpumask_intersects(new_mask, cpu_active_mask)) { + if (!cpumask_intersects(new_mask, cpu_valid_mask)) { ret = -EINVAL; goto out; } do_set_cpus_allowed(p, new_mask); + if (p->flags & PF_KTHREAD) { + /* +* For kernel threads that do indeed end up on online && +* !active we want to ensure they are strict per-cpu threads. +*/ + WARN_ON(cpumask_intersects(new_mask, cpu_online_mask) && + !cpumask_intersects(new_mask, cpu_active_mask) && + p->nr_cpus_allowed != 1); + } +
[tip:smp/hotplug] sched: Allow per-cpu kernel threads to run on online && !active
Commit-ID: e9d867a67fd03ccc07248ca4e9c2f74fed494d5b Gitweb: http://git.kernel.org/tip/e9d867a67fd03ccc07248ca4e9c2f74fed494d5b Author: Peter Zijlstra (Intel) AuthorDate: Thu, 10 Mar 2016 12:54:08 +0100 Committer: Thomas Gleixner CommitDate: Fri, 6 May 2016 14:58:22 +0200 sched: Allow per-cpu kernel threads to run on online && !active In order to enable symmetric hotplug, we must mirror the online && !active state of cpu-down on the cpu-up side. However, to retain sanity, limit this state to per-cpu kthreads. Aside from the change to set_cpus_allowed_ptr(), which allow moving the per-cpu kthreads on, the other critical piece is the cpu selection for pinned tasks in select_task_rq(). This avoids dropping into select_fallback_rq(). select_fallback_rq() cannot be allowed to select !active cpus because its used to migrate user tasks away. And we do not want to move user tasks onto cpus that are in transition. Requested-by: Thomas Gleixner Signed-off-by: Peter Zijlstra (Intel) Tested-by: Thomas Gleixner Cc: Lai Jiangshan Cc: Jan H. Schönherr Cc: Oleg Nesterov Cc: r...@linutronix.de Link: http://lkml.kernel.org/r/20160301152303.gv6...@twins.programming.kicks-ass.net Signed-off-by: Thomas Gleixner --- arch/powerpc/kernel/smp.c | 2 +- arch/s390/kernel/smp.c| 2 +- include/linux/cpumask.h | 6 ++ kernel/sched/core.c | 49 --- 4 files changed, 46 insertions(+), 13 deletions(-) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 8cac1eb..55c924b 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -565,7 +565,7 @@ int __cpu_up(unsigned int cpu, struct task_struct *tidle) smp_ops->give_timebase(); /* Wait until cpu puts itself in the online & active maps */ - while (!cpu_online(cpu) || !cpu_active(cpu)) + while (!cpu_online(cpu)) cpu_relax(); return 0; diff --git a/arch/s390/kernel/smp.c b/arch/s390/kernel/smp.c index 40a6b4f..7b89a75 100644 --- a/arch/s390/kernel/smp.c +++ b/arch/s390/kernel/smp.c @@ -832,7 +832,7 @@ int __cpu_up(unsigned int cpu, struct task_struct *tidle) pcpu_attach_task(pcpu, tidle); pcpu_start_fn(pcpu, smp_start_secondary, NULL); /* Wait until cpu puts itself in the online & active maps */ - while (!cpu_online(cpu) || !cpu_active(cpu)) + while (!cpu_online(cpu)) cpu_relax(); return 0; } diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h index 40cee6b..e828cf6 100644 --- a/include/linux/cpumask.h +++ b/include/linux/cpumask.h @@ -743,12 +743,10 @@ set_cpu_present(unsigned int cpu, bool present) static inline void set_cpu_online(unsigned int cpu, bool online) { - if (online) { + if (online) cpumask_set_cpu(cpu, &__cpu_online_mask); - cpumask_set_cpu(cpu, &__cpu_active_mask); - } else { + else cpumask_clear_cpu(cpu, &__cpu_online_mask); - } } static inline void diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 8b489fc..8bfd7d4 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1082,13 +1082,21 @@ void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask) static int __set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask, bool check) { + const struct cpumask *cpu_valid_mask = cpu_active_mask; + unsigned int dest_cpu; unsigned long flags; struct rq *rq; - unsigned int dest_cpu; int ret = 0; rq = task_rq_lock(p, &flags); + if (p->flags & PF_KTHREAD) { + /* +* Kernel threads are allowed on online && !active CPUs +*/ + cpu_valid_mask = cpu_online_mask; + } + /* * Must re-check here, to close a race against __kthread_bind(), * sched_setaffinity() is not guaranteed to observe the flag. @@ -1101,18 +1109,28 @@ static int __set_cpus_allowed_ptr(struct task_struct *p, if (cpumask_equal(&p->cpus_allowed, new_mask)) goto out; - if (!cpumask_intersects(new_mask, cpu_active_mask)) { + if (!cpumask_intersects(new_mask, cpu_valid_mask)) { ret = -EINVAL; goto out; } do_set_cpus_allowed(p, new_mask); + if (p->flags & PF_KTHREAD) { + /* +* For kernel threads that do indeed end up on online && +* !active we want to ensure they are strict per-cpu threads. +*/ + WARN_ON(cpumask_intersects(new_mask, cpu_online_mask) && + !cpumask_intersects(new_mask, cpu_active_mask) && + p->nr_cpus_allowed != 1); + } +
[tip:perf/core] perf/core: Rename perf_event_read_{one,group}, perf_read_hw
Commit-ID: b15f495b4e9295cf21065d8569835a2f18cfe41b Gitweb: http://git.kernel.org/tip/b15f495b4e9295cf21065d8569835a2f18cfe41b Author: Peter Zijlstra (Intel) AuthorDate: Thu, 3 Sep 2015 20:07:47 -0700 Committer: Ingo Molnar CommitDate: Sun, 13 Sep 2015 11:27:26 +0200 perf/core: Rename perf_event_read_{one,group}, perf_read_hw In order to free up the perf_event_read_group() name: s/perf_event_read_\(one\|group\)/perf_read_\1/g s/perf_read_hw/__perf_read/g Signed-off-by: Peter Zijlstra (Intel) Cc: Arnaldo Carvalho de Melo Cc: Arnaldo Carvalho de Melo Cc: Jiri Olsa Cc: Linus Torvalds Cc: Michael Ellerman Cc: Peter Zijlstra Cc: Stephane Eranian Cc: Thomas Gleixner Cc: Vince Weaver Link: http://lkml.kernel.org/r/1441336073-22750-5-git-send-email-suka...@linux.vnet.ibm.com Signed-off-by: Ingo Molnar --- kernel/events/core.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/kernel/events/core.c b/kernel/events/core.c index 260bf8c..67b7dba 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -3742,7 +3742,7 @@ static void put_event(struct perf_event *event) * see the comment there. * * 2) there is a lock-inversion with mmap_sem through -* perf_event_read_group(), which takes faults while +* perf_read_group(), which takes faults while * holding ctx->mutex, however this is called after * the last filedesc died, so there is no possibility * to trigger the AB-BA case. @@ -3837,7 +3837,7 @@ u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running) } EXPORT_SYMBOL_GPL(perf_event_read_value); -static int perf_event_read_group(struct perf_event *event, +static int perf_read_group(struct perf_event *event, u64 read_format, char __user *buf) { struct perf_event *leader = event->group_leader, *sub; @@ -3885,7 +3885,7 @@ static int perf_event_read_group(struct perf_event *event, return ret; } -static int perf_event_read_one(struct perf_event *event, +static int perf_read_one(struct perf_event *event, u64 read_format, char __user *buf) { u64 enabled, running; @@ -3923,7 +3923,7 @@ static bool is_event_hup(struct perf_event *event) * Read the performance event - simple non blocking version for now */ static ssize_t -perf_read_hw(struct perf_event *event, char __user *buf, size_t count) +__perf_read(struct perf_event *event, char __user *buf, size_t count) { u64 read_format = event->attr.read_format; int ret; @@ -3941,9 +3941,9 @@ perf_read_hw(struct perf_event *event, char __user *buf, size_t count) WARN_ON_ONCE(event->ctx->parent_ctx); if (read_format & PERF_FORMAT_GROUP) - ret = perf_event_read_group(event, read_format, buf); + ret = perf_read_group(event, read_format, buf); else - ret = perf_event_read_one(event, read_format, buf); + ret = perf_read_one(event, read_format, buf); return ret; } @@ -3956,7 +3956,7 @@ perf_read(struct file *file, char __user *buf, size_t count, loff_t *ppos) int ret; ctx = perf_event_ctx_lock(event); - ret = perf_read_hw(event, buf, count); + ret = __perf_read(event, buf, count); perf_event_ctx_unlock(event, ctx); return ret; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:sched/core] wait.[ch]: Introduce the simple waitqueue (swait) implementation
Commit-ID: 13b35686e8b934ff78f59cef0c65fa3a43f8eeaf Gitweb: http://git.kernel.org/tip/13b35686e8b934ff78f59cef0c65fa3a43f8eeaf Author: Peter Zijlstra (Intel) AuthorDate: Fri, 19 Feb 2016 09:46:37 +0100 Committer: Thomas Gleixner CommitDate: Thu, 25 Feb 2016 11:27:16 +0100 wait.[ch]: Introduce the simple waitqueue (swait) implementation The existing wait queue support has support for custom wake up call backs, wake flags, wake key (passed to call back) and exclusive flags that allow wakers to be tagged as exclusive, for limiting the number of wakers. In a lot of cases, none of these features are used, and hence we can benefit from a slimmed down version that lowers memory overhead and reduces runtime overhead. The concept originated from -rt, where waitqueues are a constant source of trouble, as we can't convert the head lock to a raw spinlock due to fancy and long lasting callbacks. With the removal of custom callbacks, we can use a raw lock for queue list manipulations, hence allowing the simple wait support to be used in -rt. [Patch is from PeterZ which is based on Thomas version. Commit message is written by Paul G. Daniel: - Fixed some compile issues - Added non-lazy implementation of swake_up_locked as suggested by Boqun Feng.] Originally-by: Thomas Gleixner Signed-off-by: Daniel Wagner Acked-by: Peter Zijlstra (Intel) Cc: linux-rt-us...@vger.kernel.org Cc: Boqun Feng Cc: Marcelo Tosatti Cc: Steven Rostedt Cc: Paul Gortmaker Cc: Paolo Bonzini Cc: "Paul E. McKenney" Link: http://lkml.kernel.org/r/1455871601-27484-2-git-send-email-w...@monom.org Signed-off-by: Thomas Gleixner --- include/linux/swait.h | 172 ++ kernel/sched/Makefile | 2 +- kernel/sched/swait.c | 123 3 files changed, 296 insertions(+), 1 deletion(-) diff --git a/include/linux/swait.h b/include/linux/swait.h new file mode 100644 index 000..c1f9c62 --- /dev/null +++ b/include/linux/swait.h @@ -0,0 +1,172 @@ +#ifndef _LINUX_SWAIT_H +#define _LINUX_SWAIT_H + +#include +#include +#include +#include + +/* + * Simple wait queues + * + * While these are very similar to the other/complex wait queues (wait.h) the + * most important difference is that the simple waitqueue allows for + * deterministic behaviour -- IOW it has strictly bounded IRQ and lock hold + * times. + * + * In order to make this so, we had to drop a fair number of features of the + * other waitqueue code; notably: + * + * - mixing INTERRUPTIBLE and UNINTERRUPTIBLE sleeps on the same waitqueue; + *all wakeups are TASK_NORMAL in order to avoid O(n) lookups for the right + *sleeper state. + * + * - the exclusive mode; because this requires preserving the list order + *and this is hard. + * + * - custom wake functions; because you cannot give any guarantees about + *random code. + * + * As a side effect of this; the data structures are slimmer. + * + * One would recommend using this wait queue where possible. + */ + +struct task_struct; + +struct swait_queue_head { + raw_spinlock_t lock; + struct list_headtask_list; +}; + +struct swait_queue { + struct task_struct *task; + struct list_headtask_list; +}; + +#define __SWAITQUEUE_INITIALIZER(name) { \ + .task = current, \ + .task_list = LIST_HEAD_INIT((name).task_list), \ +} + +#define DECLARE_SWAITQUEUE(name) \ + struct swait_queue name = __SWAITQUEUE_INITIALIZER(name) + +#define __SWAIT_QUEUE_HEAD_INITIALIZER(name) { \ + .lock = __RAW_SPIN_LOCK_UNLOCKED(name.lock), \ + .task_list = LIST_HEAD_INIT((name).task_list), \ +} + +#define DECLARE_SWAIT_QUEUE_HEAD(name) \ + struct swait_queue_head name = __SWAIT_QUEUE_HEAD_INITIALIZER(name) + +extern void __init_swait_queue_head(struct swait_queue_head *q, const char *name, + struct lock_class_key *key); + +#define init_swait_queue_head(q) \ + do {\ + static struct lock_class_key __key; \ + __init_swait_queue_head((q), #q, &__key); \ + } while (0) + +#ifdef CONFIG_LOCKDEP +# define __SWAIT_QUEUE_HEAD_INIT_ONSTACK(name) \ + ({ init_swait_queue_head(&name); name; }) +# define DECLARE_SWAIT_QUEUE_HEAD_ONSTACK(name)\ + struct swait_queue_head name = __SWAIT_QUEUE_HEAD_INIT_ONSTACK(name) +#else +# define DECLARE_SWAIT_QUEUE_HEAD_ONSTACK(name)\ + DECLARE_SWAIT_QUEUE_HEAD(name) +#endif + +static inline int swait_active(struct swait_queue_hea
[tip:locking/core] locking/qspinlock: Add pending bit
Commit-ID: c1fb159db9f2e50e0f4025bed92a67a6a7bfa7b7 Gitweb: http://git.kernel.org/tip/c1fb159db9f2e50e0f4025bed92a67a6a7bfa7b7 Author: Peter Zijlstra (Intel) AuthorDate: Fri, 24 Apr 2015 14:56:32 -0400 Committer: Ingo Molnar CommitDate: Fri, 8 May 2015 12:36:32 +0200 locking/qspinlock: Add pending bit Because the qspinlock needs to touch a second cacheline (the per-cpu mcs_nodes[]); add a pending bit and allow a single in-word spinner before we punt to the second cacheline. It is possible so observe the pending bit without the locked bit when the last owner has just released but the pending owner has not yet taken ownership. In this case we would normally queue -- because the pending bit is already taken. However, in this case the pending bit is guaranteed to be released 'soon', therefore wait for it and avoid queueing. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Waiman Long Signed-off-by: Peter Zijlstra (Intel) Cc: Andrew Morton Cc: Boris Ostrovsky Cc: Borislav Petkov Cc: Daniel J Blueman Cc: David Vrabel Cc: Douglas Hatch Cc: H. Peter Anvin Cc: Konrad Rzeszutek Wilk Cc: Linus Torvalds Cc: Oleg Nesterov Cc: Paolo Bonzini Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Raghavendra K T Cc: Rik van Riel Cc: Scott J Norton Cc: Thomas Gleixner Cc: virtualizat...@lists.linux-foundation.org Cc: xen-de...@lists.xenproject.org Link: http://lkml.kernel.org/r/1429901803-29771-4-git-send-email-waiman.l...@hp.com Signed-off-by: Ingo Molnar --- include/asm-generic/qspinlock_types.h | 12 +++- kernel/locking/qspinlock.c| 119 -- 2 files changed, 107 insertions(+), 24 deletions(-) diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h index aec05c7..7ee6632 100644 --- a/include/asm-generic/qspinlock_types.h +++ b/include/asm-generic/qspinlock_types.h @@ -36,8 +36,9 @@ typedef struct qspinlock { * Bitfields in the atomic value: * * 0- 7: locked byte - * 8- 9: tail index - * 10-31: tail cpu (+1) + * 8: pending + * 9-10: tail index + * 11-31: tail cpu (+1) */ #define_Q_SET_MASK(type) (((1U << _Q_ ## type ## _BITS) - 1)\ << _Q_ ## type ## _OFFSET) @@ -45,7 +46,11 @@ typedef struct qspinlock { #define _Q_LOCKED_BITS 8 #define _Q_LOCKED_MASK _Q_SET_MASK(LOCKED) -#define _Q_TAIL_IDX_OFFSET (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS) +#define _Q_PENDING_OFFSET (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS) +#define _Q_PENDING_BITS1 +#define _Q_PENDING_MASK_Q_SET_MASK(PENDING) + +#define _Q_TAIL_IDX_OFFSET (_Q_PENDING_OFFSET + _Q_PENDING_BITS) #define _Q_TAIL_IDX_BITS 2 #define _Q_TAIL_IDX_MASK _Q_SET_MASK(TAIL_IDX) @@ -54,5 +59,6 @@ typedef struct qspinlock { #define _Q_TAIL_CPU_MASK _Q_SET_MASK(TAIL_CPU) #define _Q_LOCKED_VAL (1U << _Q_LOCKED_OFFSET) +#define _Q_PENDING_VAL (1U << _Q_PENDING_OFFSET) #endif /* __ASM_GENERIC_QSPINLOCK_TYPES_H */ diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index 029b51c..af9c2ef 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -94,24 +94,28 @@ static inline struct mcs_spinlock *decode_tail(u32 tail) return per_cpu_ptr(&mcs_nodes[idx], cpu); } +#define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK) + /** * queued_spin_lock_slowpath - acquire the queued spinlock * @lock: Pointer to queued spinlock structure * @val: Current value of the queued spinlock 32-bit word * - * (queue tail, lock value) - * - * fast :slow : unlock - *: : - * uncontended (0,0) --:--> (0,1) :--> (*,0) - *: | ^./ : - *: v \ | : - * uncontended:(n,x) --+--> (n,0) | : - * queue: | ^--' | : - *: v | : - * contended :(*,x) --+--> (*,0) -> (*,1) ---' : - * queue: ^--' : + * (queue tail, pending bit, lock value) * + * fast :slow :unlock + * : : + * uncontended (0,0,0) -:--> (0,0,1) --:--> (*,*,0) + * : | ^.--. / : + * : v \ \| : + * pending :(0,1,1) +--> (0,1,0) \ | : + * :
[tip:locking/core] locking/qspinlock: Optimize for smaller NR_CPUS
Commit-ID: 69f9cae90907e09af95fb991ed384670cef8dd32 Gitweb: http://git.kernel.org/tip/69f9cae90907e09af95fb991ed384670cef8dd32 Author: Peter Zijlstra (Intel) AuthorDate: Fri, 24 Apr 2015 14:56:34 -0400 Committer: Ingo Molnar CommitDate: Fri, 8 May 2015 12:36:48 +0200 locking/qspinlock: Optimize for smaller NR_CPUS When we allow for a max NR_CPUS < 2^14 we can optimize the pending wait-acquire and the xchg_tail() operations. By growing the pending bit to a byte, we reduce the tail to 16bit. This means we can use xchg16 for the tail part and do away with all the repeated compxchg() operations. This in turn allows us to unconditionally acquire; the locked state as observed by the wait loops cannot change. And because both locked and pending are now a full byte we can use simple stores for the state transition, obviating one atomic operation entirely. This optimization is needed to make the qspinlock achieve performance parity with ticket spinlock at light load. All this is horribly broken on Alpha pre EV56 (and any other arch that cannot do single-copy atomic byte stores). Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Waiman Long Signed-off-by: Peter Zijlstra (Intel) Cc: Andrew Morton Cc: Boris Ostrovsky Cc: Borislav Petkov Cc: Daniel J Blueman Cc: David Vrabel Cc: Douglas Hatch Cc: H. Peter Anvin Cc: Konrad Rzeszutek Wilk Cc: Linus Torvalds Cc: Oleg Nesterov Cc: Paolo Bonzini Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Raghavendra K T Cc: Rik van Riel Cc: Scott J Norton Cc: Thomas Gleixner Cc: virtualizat...@lists.linux-foundation.org Cc: xen-de...@lists.xenproject.org Link: http://lkml.kernel.org/r/1429901803-29771-6-git-send-email-waiman.l...@hp.com Signed-off-by: Ingo Molnar --- include/asm-generic/qspinlock_types.h | 13 +++ kernel/locking/qspinlock.c| 69 ++- 2 files changed, 81 insertions(+), 1 deletion(-) diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h index 3a7f671..85f888e 100644 --- a/include/asm-generic/qspinlock_types.h +++ b/include/asm-generic/qspinlock_types.h @@ -35,6 +35,14 @@ typedef struct qspinlock { /* * Bitfields in the atomic value: * + * When NR_CPUS < 16K + * 0- 7: locked byte + * 8: pending + * 9-15: not used + * 16-17: tail index + * 18-31: tail cpu (+1) + * + * When NR_CPUS >= 16K * 0- 7: locked byte * 8: pending * 9-10: tail index @@ -47,7 +55,11 @@ typedef struct qspinlock { #define _Q_LOCKED_MASK _Q_SET_MASK(LOCKED) #define _Q_PENDING_OFFSET (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS) +#if CONFIG_NR_CPUS < (1U << 14) +#define _Q_PENDING_BITS8 +#else #define _Q_PENDING_BITS1 +#endif #define _Q_PENDING_MASK_Q_SET_MASK(PENDING) #define _Q_TAIL_IDX_OFFSET (_Q_PENDING_OFFSET + _Q_PENDING_BITS) @@ -58,6 +70,7 @@ typedef struct qspinlock { #define _Q_TAIL_CPU_BITS (32 - _Q_TAIL_CPU_OFFSET) #define _Q_TAIL_CPU_MASK _Q_SET_MASK(TAIL_CPU) +#define _Q_TAIL_OFFSET _Q_TAIL_IDX_OFFSET #define _Q_TAIL_MASK (_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK) #define _Q_LOCKED_VAL (1U << _Q_LOCKED_OFFSET) diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index 82bb4a9..e17efe7 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -24,6 +24,7 @@ #include #include #include +#include #include /* @@ -56,6 +57,10 @@ * node; whereby avoiding the need to carry a node from lock to unlock, and * preserving existing lock API. This also makes the unlock code simpler and * faster. + * + * N.B. The current implementation only supports architectures that allow + * atomic operations on smaller 8-bit and 16-bit data types. + * */ #include "mcs_spinlock.h" @@ -96,6 +101,62 @@ static inline struct mcs_spinlock *decode_tail(u32 tail) #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK) +/* + * By using the whole 2nd least significant byte for the pending bit, we + * can allow better optimization of the lock acquisition for the pending + * bit holder. + */ +#if _Q_PENDING_BITS == 8 + +struct __qspinlock { + union { + atomic_t val; + struct { +#ifdef __LITTLE_ENDIAN + u16 locked_pending; + u16 tail; +#else + u16 tail; + u16 locked_pending; +#endif + }; + }; +}; + +/** + * clear_pending_set_locked - take ownership and clear the pending bit. + * @lock: Pointer to queued spinlock structure + * + * *,1,0 -> *,0,1 + * + * Lock stealing is not allowed if this function is used. + */ +static __always_inline void clear_pending_set_locked(struct qspinlock *lock) +{ + struct __qspinlock *l = (void *)lock; + + WRITE_ONCE(l->locked_pending, _Q_LOCKED_VAL); +} + +/* + * xchg_ta
[tip:locking/core] locking/qspinlock: Revert to test-and-set on hypervisors
Commit-ID: 2aa79af64263190eec610422b07f60e99a7d230a Gitweb: http://git.kernel.org/tip/2aa79af64263190eec610422b07f60e99a7d230a Author: Peter Zijlstra (Intel) AuthorDate: Fri, 24 Apr 2015 14:56:36 -0400 Committer: Ingo Molnar CommitDate: Fri, 8 May 2015 12:36:58 +0200 locking/qspinlock: Revert to test-and-set on hypervisors When we detect a hypervisor (!paravirt, see qspinlock paravirt support patches), revert to a simple test-and-set lock to avoid the horrors of queue preemption. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Waiman Long Signed-off-by: Peter Zijlstra (Intel) Cc: Andrew Morton Cc: Boris Ostrovsky Cc: Borislav Petkov Cc: Daniel J Blueman Cc: David Vrabel Cc: Douglas Hatch Cc: H. Peter Anvin Cc: Konrad Rzeszutek Wilk Cc: Linus Torvalds Cc: Oleg Nesterov Cc: Paolo Bonzini Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Raghavendra K T Cc: Rik van Riel Cc: Scott J Norton Cc: Thomas Gleixner Cc: virtualizat...@lists.linux-foundation.org Cc: xen-de...@lists.xenproject.org Link: http://lkml.kernel.org/r/1429901803-29771-8-git-send-email-waiman.l...@hp.com Signed-off-by: Ingo Molnar --- arch/x86/include/asm/qspinlock.h | 14 ++ include/asm-generic/qspinlock.h | 7 +++ kernel/locking/qspinlock.c | 3 +++ 3 files changed, 24 insertions(+) diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h index e2aee82..f079b70 100644 --- a/arch/x86/include/asm/qspinlock.h +++ b/arch/x86/include/asm/qspinlock.h @@ -1,6 +1,7 @@ #ifndef _ASM_X86_QSPINLOCK_H #define _ASM_X86_QSPINLOCK_H +#include #include #definequeued_spin_unlock queued_spin_unlock @@ -15,6 +16,19 @@ static inline void queued_spin_unlock(struct qspinlock *lock) smp_store_release((u8 *)lock, 0); } +#define virt_queued_spin_lock virt_queued_spin_lock + +static inline bool virt_queued_spin_lock(struct qspinlock *lock) +{ + if (!static_cpu_has(X86_FEATURE_HYPERVISOR)) + return false; + + while (atomic_cmpxchg(&lock->val, 0, _Q_LOCKED_VAL) != 0) + cpu_relax(); + + return true; +} + #include #endif /* _ASM_X86_QSPINLOCK_H */ diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h index 569abcd..83bfb87 100644 --- a/include/asm-generic/qspinlock.h +++ b/include/asm-generic/qspinlock.h @@ -111,6 +111,13 @@ static inline void queued_spin_unlock_wait(struct qspinlock *lock) cpu_relax(); } +#ifndef virt_queued_spin_lock +static __always_inline bool virt_queued_spin_lock(struct qspinlock *lock) +{ + return false; +} +#endif + /* * Initializier */ diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index 0338721..fd31a47 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -249,6 +249,9 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val) BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS)); + if (virt_queued_spin_lock(lock)) + return; + /* * wait for in-progress pending->locked hand-overs * -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:locking/core] locking/pvqspinlock, x86: Implement the paravirt qspinlock call patching
Commit-ID: f233f7f1581e78fd9b4023f2e7d8c1ed89020cc9 Gitweb: http://git.kernel.org/tip/f233f7f1581e78fd9b4023f2e7d8c1ed89020cc9 Author: Peter Zijlstra (Intel) AuthorDate: Fri, 24 Apr 2015 14:56:38 -0400 Committer: Ingo Molnar CommitDate: Fri, 8 May 2015 12:37:09 +0200 locking/pvqspinlock, x86: Implement the paravirt qspinlock call patching We use the regular paravirt call patching to switch between: native_queued_spin_lock_slowpath()__pv_queued_spin_lock_slowpath() native_queued_spin_unlock() __pv_queued_spin_unlock() We use a callee saved call for the unlock function which reduces the i-cache footprint and allows 'inlining' of SPIN_UNLOCK functions again. We further optimize the unlock path by patching the direct call with a "movb $0,%arg1" if we are indeed using the native unlock code. This makes the unlock code almost as fast as the !PARAVIRT case. This significantly lowers the overhead of having CONFIG_PARAVIRT_SPINLOCKS enabled, even for native code. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Waiman Long Signed-off-by: Peter Zijlstra (Intel) Cc: Andrew Morton Cc: Boris Ostrovsky Cc: Borislav Petkov Cc: Daniel J Blueman Cc: David Vrabel Cc: Douglas Hatch Cc: H. Peter Anvin Cc: Konrad Rzeszutek Wilk Cc: Linus Torvalds Cc: Oleg Nesterov Cc: Paolo Bonzini Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Raghavendra K T Cc: Rik van Riel Cc: Scott J Norton Cc: Thomas Gleixner Cc: virtualizat...@lists.linux-foundation.org Cc: xen-de...@lists.xenproject.org Link: http://lkml.kernel.org/r/1429901803-29771-10-git-send-email-waiman.l...@hp.com Signed-off-by: Ingo Molnar --- arch/x86/Kconfig | 2 +- arch/x86/include/asm/paravirt.h | 29 - arch/x86/include/asm/paravirt_types.h | 10 ++ arch/x86/include/asm/qspinlock.h | 25 - arch/x86/include/asm/qspinlock_paravirt.h | 6 ++ arch/x86/kernel/paravirt-spinlocks.c | 24 +++- arch/x86/kernel/paravirt_patch_32.c | 22 ++ arch/x86/kernel/paravirt_patch_64.c | 22 ++ 8 files changed, 128 insertions(+), 12 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 90b1b54..50ec043 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -667,7 +667,7 @@ config PARAVIRT_DEBUG config PARAVIRT_SPINLOCKS bool "Paravirtualization layer for spinlocks" depends on PARAVIRT && SMP - select UNINLINE_SPIN_UNLOCK + select UNINLINE_SPIN_UNLOCK if !QUEUED_SPINLOCK ---help--- Paravirtualized spinlocks allow a pvops backend to replace the spinlock implementation with something virtualization-friendly diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index 8957810..266c353 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -712,6 +712,31 @@ static inline void __set_fixmap(unsigned /* enum fixed_addresses */ idx, #if defined(CONFIG_SMP) && defined(CONFIG_PARAVIRT_SPINLOCKS) +#ifdef CONFIG_QUEUED_SPINLOCK + +static __always_inline void pv_queued_spin_lock_slowpath(struct qspinlock *lock, + u32 val) +{ + PVOP_VCALL2(pv_lock_ops.queued_spin_lock_slowpath, lock, val); +} + +static __always_inline void pv_queued_spin_unlock(struct qspinlock *lock) +{ + PVOP_VCALLEE1(pv_lock_ops.queued_spin_unlock, lock); +} + +static __always_inline void pv_wait(u8 *ptr, u8 val) +{ + PVOP_VCALL2(pv_lock_ops.wait, ptr, val); +} + +static __always_inline void pv_kick(int cpu) +{ + PVOP_VCALL1(pv_lock_ops.kick, cpu); +} + +#else /* !CONFIG_QUEUED_SPINLOCK */ + static __always_inline void __ticket_lock_spinning(struct arch_spinlock *lock, __ticket_t ticket) { @@ -724,7 +749,9 @@ static __always_inline void __ticket_unlock_kick(struct arch_spinlock *lock, PVOP_VCALL2(pv_lock_ops.unlock_kick, lock, ticket); } -#endif +#endif /* CONFIG_QUEUED_SPINLOCK */ + +#endif /* SMP && PARAVIRT_SPINLOCKS */ #ifdef CONFIG_X86_32 #define PV_SAVE_REGS "pushl %ecx; pushl %edx;" diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index f7b0b5c..76cd684 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -333,9 +333,19 @@ struct arch_spinlock; typedef u16 __ticket_t; #endif +struct qspinlock; + struct pv_lock_ops { +#ifdef CONFIG_QUEUED_SPINLOCK + void (*queued_spin_lock_slowpath)(struct qspinlock *lock, u32 val); + struct paravirt_callee_save queued_spin_unlock; + + void (*wait)(u8 *ptr, u8 val); + void (*kick)(int cpu); +#else /* !CONFIG_QUEUED_SPINLOCK */ struct paravirt_callee_save lock_spinning; v
[tip:perf/core] perf: Fix move_group() order
Commit-ID: 8f95b435b62522aed3381aaea920de8d09ccabf3 Gitweb: http://git.kernel.org/tip/8f95b435b62522aed3381aaea920de8d09ccabf3 Author: Peter Zijlstra (Intel) AuthorDate: Tue, 27 Jan 2015 11:53:12 +0100 Committer: Ingo Molnar CommitDate: Wed, 4 Feb 2015 08:07:11 +0100 perf: Fix move_group() order Jiri reported triggering the new WARN_ON_ONCE in event_sched_out over the weekend: event_sched_out.isra.79+0x2b9/0x2d0 group_sched_out+0x69/0xc0 ctx_sched_out+0x106/0x130 task_ctx_sched_out+0x37/0x70 __perf_install_in_context+0x70/0x1a0 remote_function+0x48/0x60 generic_exec_single+0x15b/0x1d0 smp_call_function_single+0x67/0xa0 task_function_call+0x53/0x80 perf_install_in_context+0x8b/0x110 I think the below should cure this; if we install a group leader it will iterate the (still intact) group list and find its siblings and try and install those too -- even though those still have the old event->ctx -- in the new ctx. Upon installing the first group sibling we'd try and schedule out the group and trigger the above warn. Fix this by installing the group leader last, installing siblings would have no effect, they're not reachable through the group lists and therefore we don't schedule them. Also delay resetting the state until we're absolutely sure the events are quiescent. Reported-by: Jiri Olsa Reported-by: vincent.wea...@maine.edu Signed-off-by: Peter Zijlstra (Intel) Cc: Arnaldo Carvalho de Melo Cc: Linus Torvalds Link: http://lkml.kernel.org/r/20150126162639.ga21...@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar --- kernel/events/core.c | 56 +++- 1 file changed, 47 insertions(+), 9 deletions(-) diff --git a/kernel/events/core.c b/kernel/events/core.c index 417a96b..142dbabc 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -7645,16 +7645,9 @@ SYSCALL_DEFINE5(perf_event_open, perf_remove_from_context(group_leader, false); - /* -* Removing from the context ends up with disabled -* event. What we want here is event in the initial -* startup state, ready to be add into new context. -*/ - perf_event__state_init(group_leader); list_for_each_entry(sibling, &group_leader->sibling_list, group_entry) { perf_remove_from_context(sibling, false); - perf_event__state_init(sibling); put_ctx(gctx); } } else { @@ -7670,13 +7663,31 @@ SYSCALL_DEFINE5(perf_event_open, */ synchronize_rcu(); - perf_install_in_context(ctx, group_leader, group_leader->cpu); - get_ctx(ctx); + /* +* Install the group siblings before the group leader. +* +* Because a group leader will try and install the entire group +* (through the sibling list, which is still in-tact), we can +* end up with siblings installed in the wrong context. +* +* By installing siblings first we NO-OP because they're not +* reachable through the group lists. +*/ list_for_each_entry(sibling, &group_leader->sibling_list, group_entry) { + perf_event__state_init(sibling); perf_install_in_context(ctx, sibling, sibling->cpu); get_ctx(ctx); } + + /* +* Removing from the context ends up with disabled +* event. What we want here is event in the initial +* startup state, ready to be add into new context. +*/ + perf_event__state_init(group_leader); + perf_install_in_context(ctx, group_leader, group_leader->cpu); + get_ctx(ctx); } perf_install_in_context(ctx, event, event->cpu); @@ -7806,8 +7817,35 @@ void perf_pmu_migrate_context(struct pmu *pmu, int src_cpu, int dst_cpu) list_add(&event->migrate_entry, &events); } + /* +* Wait for the events to quiesce before re-instating them. +*/ synchronize_rcu(); + /* +* Re-instate events in 2 passes. +* +* Skip over group leaders and only install siblings on this first +* pass, siblings will not get enabled without a leader, however a +* leader will enable its siblings, even if those are still on the old +* context. +*/ + list_for_each_entry_safe(event, tmp, &events, migrate_entry) { + if (event->group_leader == event) + continue; + +
[tip:x86/boot] x86/kaslr, ACPI/NUMA: Fix KASLR build error
Commit-ID: 3a387c6d96e69f1710a3804eb68e1253263298f2 Gitweb: https://git.kernel.org/tip/3a387c6d96e69f1710a3804eb68e1253263298f2 Author: Peter Zijlstra (Intel) AuthorDate: Wed, 3 Oct 2018 14:41:27 +0200 Committer: Borislav Petkov CommitDate: Wed, 3 Oct 2018 16:15:49 +0200 x86/kaslr, ACPI/NUMA: Fix KASLR build error There is no point in trying to compile KASLR-specific code when there is no KASLR. [ bp: Move the whole crap into kaslr.c and make rand_mem_physical_padding static. ] Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Borislav Petkov Cc: Cc: Cc: Cc: Cc: Cc: Link: http://lkml.kernel.org/r/20181003123402.ga15...@hirez.programming.kicks-ass.net --- arch/x86/include/asm/kaslr.h | 2 ++ arch/x86/include/asm/setup.h | 2 -- arch/x86/mm/kaslr.c | 19 ++- drivers/acpi/numa.c | 15 +++ 4 files changed, 23 insertions(+), 15 deletions(-) diff --git a/arch/x86/include/asm/kaslr.h b/arch/x86/include/asm/kaslr.h index db7ba2feb947..95ef3fc01d12 100644 --- a/arch/x86/include/asm/kaslr.h +++ b/arch/x86/include/asm/kaslr.h @@ -6,8 +6,10 @@ unsigned long kaslr_get_random_long(const char *purpose); #ifdef CONFIG_RANDOMIZE_MEMORY void kernel_randomize_memory(void); +void kaslr_check_padding(void); #else static inline void kernel_randomize_memory(void) { } +static inline void kaslr_check_padding(void) { } #endif /* CONFIG_RANDOMIZE_MEMORY */ #endif diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h index 65a5bf8f6aba..ae13bc974416 100644 --- a/arch/x86/include/asm/setup.h +++ b/arch/x86/include/asm/setup.h @@ -80,8 +80,6 @@ static inline unsigned long kaslr_offset(void) return (unsigned long)&_text - __START_KERNEL; } -extern int rand_mem_physical_padding; - /* * Do NOT EVER look at the BIOS memory size location. * It does not work on many machines. diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c index 00cf4cae38f5..b3471388288d 100644 --- a/arch/x86/mm/kaslr.c +++ b/arch/x86/mm/kaslr.c @@ -23,6 +23,7 @@ #include #include #include +#include #include #include @@ -40,7 +41,7 @@ */ static const unsigned long vaddr_end = CPU_ENTRY_AREA_BASE; -int __initdata rand_mem_physical_padding = CONFIG_RANDOMIZE_MEMORY_PHYSICAL_PADDING; +static int __initdata rand_mem_physical_padding = CONFIG_RANDOMIZE_MEMORY_PHYSICAL_PADDING; /* * Memory regions randomized by KASLR (except modules that use a separate logic * earlier during boot). The list is ordered based on virtual addresses. This @@ -70,6 +71,22 @@ static inline bool kaslr_memory_enabled(void) return kaslr_enabled() && !IS_ENABLED(CONFIG_KASAN); } +/* + * Check the padding size for KASLR is enough. + */ +void __init kaslr_check_padding(void) +{ + u64 max_possible_phys, max_actual_phys, threshold; + + max_actual_phys = roundup(PFN_PHYS(max_pfn), 1ULL << 40); + max_possible_phys = roundup(PFN_PHYS(max_possible_pfn), 1ULL << 40); + threshold = max_actual_phys + ((u64)rand_mem_physical_padding << 40); + + if (max_possible_phys > threshold) + pr_warn("Set 'rand_mem_physical_padding=%llu' to avoid memory hotadd failure.\n", + (max_possible_phys - max_actual_phys) >> 40); +} + static int __init rand_mem_physical_padding_setup(char *str) { int max_padding = (1 << (MAX_PHYSMEM_BITS - TB_SHIFT)) - 1; diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c index 3d69834c692f..4408e37600ef 100644 --- a/drivers/acpi/numa.c +++ b/drivers/acpi/numa.c @@ -32,7 +32,7 @@ #include #include #include -#include +#include static nodemask_t nodes_found_map = NODE_MASK_NONE; @@ -436,7 +436,6 @@ acpi_table_parse_srat(enum acpi_srat_type id, int __init acpi_numa_init(void) { int cnt = 0; - u64 max_possible_phys, max_actual_phys, threshold; if (acpi_disabled) return -EINVAL; @@ -466,17 +465,9 @@ int __init acpi_numa_init(void) cnt = acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY, acpi_parse_memory_affinity, 0); - /* check the padding size for KASLR is enough. */ - if (parsed_numa_memblks && kaslr_enabled()) { - max_actual_phys = roundup(PFN_PHYS(max_pfn), 1ULL << 40); - max_possible_phys = roundup(PFN_PHYS(max_possible_pfn), 1ULL << 40); - threshold = max_actual_phys + ((u64)rand_mem_physical_padding << 40); + if (parsed_numa_memblks) + kaslr_check_padding(); - if (max_possible_phys > threshold) { - pr_warn("Set 'rand_mem_physical_padding=%llu' to avoid memory hotadd failure.\n", - (max_possible_phys -
[tip:perf/core] perf annotate: Simplify header dotted line sizing
Commit-ID: 53dd9b5f95dda95bcadda1b4680be42dfe1f9e5e Gitweb: http://git.kernel.org/tip/53dd9b5f95dda95bcadda1b4680be42dfe1f9e5e Author: Peter Zijlstra (Intel) AuthorDate: Thu, 30 Jun 2016 09:17:26 -0300 Committer: Arnaldo Carvalho de Melo CommitDate: Thu, 30 Jun 2016 09:21:03 -0300 perf annotate: Simplify header dotted line sizing No need to use strlen, etc to figure that out, just use the return from printf(), it will tell how wide the following line needs to be. Signed-off-by: Peter Zijlstra (Intel) Cc: Jiri Olsa Link: http://lkml.kernel.org/r/20160630082955.ga30...@twins.programming.kicks-ass.net [ split from a larger patch ] Signed-off-by: Arnaldo Carvalho de Melo --- tools/perf/util/annotate.c | 9 +++-- 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/tools/perf/util/annotate.c b/tools/perf/util/annotate.c index c385fec..78e5d6f 100644 --- a/tools/perf/util/annotate.c +++ b/tools/perf/util/annotate.c @@ -1528,7 +1528,7 @@ int symbol__annotate_printf(struct symbol *sym, struct map *map, int more = 0; u64 len; int width = 8; - int namelen, evsel_name_len, graph_dotted_len; + int graph_dotted_len; filename = strdup(dso->long_name); if (!filename) @@ -1540,17 +1540,14 @@ int symbol__annotate_printf(struct symbol *sym, struct map *map, d_filename = basename(filename); len = symbol__size(sym); - namelen = strlen(d_filename); - evsel_name_len = strlen(evsel_name); if (perf_evsel__is_group_event(evsel)) width *= evsel->nr_members; - printf(" %-*.*s|Source code & Disassembly of %s for %s\n", + graph_dotted_len = printf(" %-*.*s| Source code & Disassembly of %s for %s\n", width, width, "Percent", d_filename, evsel_name); - graph_dotted_len = width + namelen + evsel_name_len; - printf("-%-*.*s-\n", + printf("%-*.*s\n", graph_dotted_len, graph_dotted_len, graph_dotted_line); if (verbose)
[tip:perf/core] perf annotate: Add number of samples to the header
Commit-ID: 135cce1bf12bd30d7d66360022f9dac6ea3a07cd Gitweb: http://git.kernel.org/tip/135cce1bf12bd30d7d66360022f9dac6ea3a07cd Author: Peter Zijlstra (Intel) AuthorDate: Thu, 30 Jun 2016 10:29:55 +0200 Committer: Arnaldo Carvalho de Melo CommitDate: Thu, 30 Jun 2016 18:27:42 -0300 perf annotate: Add number of samples to the header Staring at annotations of large functions is useless if there's only a few samples in them. Report the number of samples in the header to make this easier to determine. Committer note: The change amounts to: - Percent | Source code & Disassembly of perf-vdso.so for cycles:u -- + Percent | Source code & Disassembly of perf-vdso.so for cycles:u (3278 samples) + Signed-off-by: Peter Zijlstra (Intel) Cc: Jiri Olsa Link: http://lkml.kernel.org/r/20160630082955.ga30...@twins.programming.kicks-ass.net [ split from a larger patch ] Signed-off-by: Arnaldo Carvalho de Melo --- tools/perf/util/annotate.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/tools/perf/util/annotate.c b/tools/perf/util/annotate.c index 78e5d6f..e9825fe 100644 --- a/tools/perf/util/annotate.c +++ b/tools/perf/util/annotate.c @@ -1522,6 +1522,7 @@ int symbol__annotate_printf(struct symbol *sym, struct map *map, const char *d_filename; const char *evsel_name = perf_evsel__name(evsel); struct annotation *notes = symbol__annotation(sym); + struct sym_hist *h = annotation__histogram(notes, evsel->idx); struct disasm_line *pos, *queue = NULL; u64 start = map__rip_2objdump(map, sym->start); int printed = 2, queue_len = 0; @@ -1544,8 +1545,8 @@ int symbol__annotate_printf(struct symbol *sym, struct map *map, if (perf_evsel__is_group_event(evsel)) width *= evsel->nr_members; - graph_dotted_len = printf(" %-*.*s| Source code & Disassembly of %s for %s\n", - width, width, "Percent", d_filename, evsel_name); + graph_dotted_len = printf(" %-*.*s| Source code & Disassembly of %s for %s (%" PRIu64 " samples)\n", + width, width, "Percent", d_filename, evsel_name, h->sum); printf("%-*.*s\n", graph_dotted_len, graph_dotted_len, graph_dotted_line);
[tip: sched/core] sched: Clean up scheduler_ipi()
The following commit has been merged into the sched/core branch of tip: Commit-ID: 90b5363acd4739769c3f38c1aff16171bd133e8c Gitweb: https://git.kernel.org/tip/90b5363acd4739769c3f38c1aff16171bd133e8c Author:Peter Zijlstra (Intel) AuthorDate:Fri, 27 Mar 2020 11:44:56 +01:00 Committer: Thomas Gleixner CommitterDate: Tue, 12 May 2020 17:10:48 +02:00 sched: Clean up scheduler_ipi() The scheduler IPI has grown weird and wonderful over the years, time for spring cleaning. Move all the non-trivial stuff out of it and into a regular smp function call IPI. This then reduces the schedule_ipi() to most of it's former NOP glory and ensures to keep the interrupt vector lean and mean. Aside of that avoiding the full irq_enter() in the x86 IPI implementation is incorrect as scheduler_ipi() can be instrumented. To work around that scheduler_ipi() had an irq_enter/exit() hack when heavy work was pending. This is gone now. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Thomas Gleixner Reviewed-by: Alexandre Chartre Link: https://lkml.kernel.org/r/20200505134058.361859...@linutronix.de --- kernel/sched/core.c | 64 --- kernel/sched/fair.c | 5 +-- kernel/sched/sched.h | 6 ++-- 3 files changed, 36 insertions(+), 39 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index b58efb1..cd2070d 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -219,6 +219,13 @@ void update_rq_clock(struct rq *rq) update_rq_clock_task(rq, delta); } +static inline void +rq_csd_init(struct rq *rq, call_single_data_t *csd, smp_call_func_t func) +{ + csd->flags = 0; + csd->func = func; + csd->info = rq; +} #ifdef CONFIG_SCHED_HRTICK /* @@ -314,16 +321,14 @@ void hrtick_start(struct rq *rq, u64 delay) hrtimer_start(&rq->hrtick_timer, ns_to_ktime(delay), HRTIMER_MODE_REL_PINNED_HARD); } + #endif /* CONFIG_SMP */ static void hrtick_rq_init(struct rq *rq) { #ifdef CONFIG_SMP - rq->hrtick_csd.flags = 0; - rq->hrtick_csd.func = __hrtick_start; - rq->hrtick_csd.info = rq; + rq_csd_init(rq, &rq->hrtick_csd, __hrtick_start); #endif - hrtimer_init(&rq->hrtick_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_HARD); rq->hrtick_timer.function = hrtick; } @@ -650,6 +655,16 @@ static inline bool got_nohz_idle_kick(void) return false; } +static void nohz_csd_func(void *info) +{ + struct rq *rq = info; + + if (got_nohz_idle_kick()) { + rq->idle_balance = 1; + raise_softirq_irqoff(SCHED_SOFTIRQ); + } +} + #else /* CONFIG_NO_HZ_COMMON */ static inline bool got_nohz_idle_kick(void) @@ -2292,6 +2307,11 @@ void sched_ttwu_pending(void) rq_unlock_irqrestore(rq, &rf); } +static void wake_csd_func(void *info) +{ + sched_ttwu_pending(); +} + void scheduler_ipi(void) { /* @@ -2300,34 +2320,6 @@ void scheduler_ipi(void) * this IPI. */ preempt_fold_need_resched(); - - if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick()) - return; - - /* -* Not all reschedule IPI handlers call irq_enter/irq_exit, since -* traditionally all their work was done from the interrupt return -* path. Now that we actually do some work, we need to make sure -* we do call them. -* -* Some archs already do call them, luckily irq_enter/exit nest -* properly. -* -* Arguably we should visit all archs and update all handlers, -* however a fair share of IPIs are still resched only so this would -* somewhat pessimize the simple resched case. -*/ - irq_enter(); - sched_ttwu_pending(); - - /* -* Check if someone kicked us for doing the nohz idle load balance. -*/ - if (unlikely(got_nohz_idle_kick())) { - this_rq()->idle_balance = 1; - raise_softirq_irqoff(SCHED_SOFTIRQ); - } - irq_exit(); } static void ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags) @@ -2336,9 +2328,9 @@ static void ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags) p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED); - if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list)) { + if (llist_add(&p->wake_entry, &rq->wake_list)) { if (!set_nr_if_polling(rq->idle)) - smp_send_reschedule(cpu); + smp_call_function_single_async(cpu, &rq->wake_csd); else trace_sched_wake_idle_without_ipi(cpu); } @@ -6693,12 +6685,16 @@ void __init sched_init(void) rq->avg_idle = 2*sysctl_sched_migration_cost;
[tip:smp/hotplug] jump_label: Pull get_online_cpus() into generic code
Commit-ID: 82947f31231157d8ab70fa8961f23fd3887a3327 Gitweb: http://git.kernel.org/tip/82947f31231157d8ab70fa8961f23fd3887a3327 Author: Peter Zijlstra (Intel) AuthorDate: Tue, 18 Apr 2017 19:05:03 +0200 Committer: Thomas Gleixner CommitDate: Thu, 20 Apr 2017 13:08:57 +0200 jump_label: Pull get_online_cpus() into generic code This change does two things: - it moves the get_online_cpus() call into generic code, with the aim of later providing some static_key ops that avoid it. - as a side effect it inverts the lock order between cpu_hotplug_lock and jump_label_mutex. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Thomas Gleixner Cc: Sebastian Siewior Cc: Steven Rostedt Cc: jba...@akamai.com Link: http://lkml.kernel.org/r/20170418103422.590118...@infradead.org --- arch/mips/kernel/jump_label.c | 2 -- arch/sparc/kernel/jump_label.c | 2 -- arch/tile/kernel/jump_label.c | 2 -- arch/x86/kernel/jump_label.c | 2 -- kernel/jump_label.c| 14 ++ 5 files changed, 14 insertions(+), 8 deletions(-) diff --git a/arch/mips/kernel/jump_label.c b/arch/mips/kernel/jump_label.c index 3e586da..32e3168 100644 --- a/arch/mips/kernel/jump_label.c +++ b/arch/mips/kernel/jump_label.c @@ -58,7 +58,6 @@ void arch_jump_label_transform(struct jump_entry *e, insn.word = 0; /* nop */ } - get_online_cpus(); mutex_lock(&text_mutex); if (IS_ENABLED(CONFIG_CPU_MICROMIPS)) { insn_p->halfword[0] = insn.word >> 16; @@ -70,7 +69,6 @@ void arch_jump_label_transform(struct jump_entry *e, (unsigned long)insn_p + sizeof(*insn_p)); mutex_unlock(&text_mutex); - put_online_cpus(); } #endif /* HAVE_JUMP_LABEL */ diff --git a/arch/sparc/kernel/jump_label.c b/arch/sparc/kernel/jump_label.c index 07933b9..93adde1 100644 --- a/arch/sparc/kernel/jump_label.c +++ b/arch/sparc/kernel/jump_label.c @@ -41,12 +41,10 @@ void arch_jump_label_transform(struct jump_entry *entry, val = 0x0100; } - get_online_cpus(); mutex_lock(&text_mutex); *insn = val; flushi(insn); mutex_unlock(&text_mutex); - put_online_cpus(); } #endif diff --git a/arch/tile/kernel/jump_label.c b/arch/tile/kernel/jump_label.c index 07802d5..93931a4 100644 --- a/arch/tile/kernel/jump_label.c +++ b/arch/tile/kernel/jump_label.c @@ -45,14 +45,12 @@ static void __jump_label_transform(struct jump_entry *e, void arch_jump_label_transform(struct jump_entry *e, enum jump_label_type type) { - get_online_cpus(); mutex_lock(&text_mutex); __jump_label_transform(e, type); flush_icache_range(e->code, e->code + sizeof(tilegx_bundle_bits)); mutex_unlock(&text_mutex); - put_online_cpus(); } __init_or_module void arch_jump_label_transform_static(struct jump_entry *e, diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c index c37bd0f..ab4f491 100644 --- a/arch/x86/kernel/jump_label.c +++ b/arch/x86/kernel/jump_label.c @@ -105,11 +105,9 @@ static void __jump_label_transform(struct jump_entry *entry, void arch_jump_label_transform(struct jump_entry *entry, enum jump_label_type type) { - get_online_cpus(); mutex_lock(&text_mutex); __jump_label_transform(entry, type, NULL, 0); mutex_unlock(&text_mutex); - put_online_cpus(); } static enum { diff --git a/kernel/jump_label.c b/kernel/jump_label.c index 6c9cb20..f3afe07 100644 --- a/kernel/jump_label.c +++ b/kernel/jump_label.c @@ -15,6 +15,7 @@ #include #include #include +#include #ifdef HAVE_JUMP_LABEL @@ -124,6 +125,12 @@ void static_key_slow_inc(struct static_key *key) return; } + /* +* A number of architectures need to synchronize I$ across +* the all CPUs, for that to be serialized against CPU hot-plug +* we need to avoid CPUs coming online. +*/ + get_online_cpus(); jump_label_lock(); if (atomic_read(&key->enabled) == 0) { atomic_set(&key->enabled, -1); @@ -133,6 +140,7 @@ void static_key_slow_inc(struct static_key *key) atomic_inc(&key->enabled); } jump_label_unlock(); + put_online_cpus(); } EXPORT_SYMBOL_GPL(static_key_slow_inc); @@ -146,6 +154,7 @@ static void __static_key_slow_dec(struct static_key *key, * returns is unbalanced, because all other static_key_slow_inc() * instances block while the update is in progress. */ + get_online_cpus(); if (!atomic_dec_and_mutex_lock(&key->enabled, &jump_label_mutex)) { WARN(atomic_read(&key->enabled) < 0, "jump label: negative count!\n"); @@ -159,6 +168,7
[tip:smp/hotplug] jump_label: Provide static_key_slow_inc_cpuslocked()
Commit-ID: f5efc6fad63f5533a6083e95286920d5753e52bf Gitweb: http://git.kernel.org/tip/f5efc6fad63f5533a6083e95286920d5753e52bf Author: Peter Zijlstra (Intel) AuthorDate: Tue, 18 Apr 2017 19:05:04 +0200 Committer: Thomas Gleixner CommitDate: Thu, 20 Apr 2017 13:08:57 +0200 jump_label: Provide static_key_slow_inc_cpuslocked() Provide static_key_slow_inc_cpuslocked(), a variant that doesn't take cpu_hotplug_lock(). Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Thomas Gleixner Cc: Sebastian Siewior Cc: Steven Rostedt Cc: jba...@akamai.com Link: http://lkml.kernel.org/r/20170418103422.636958...@infradead.org --- include/linux/jump_label.h | 3 +++ kernel/jump_label.c| 21 + 2 files changed, 20 insertions(+), 4 deletions(-) diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h index 2afd74b..7d07f0b 100644 --- a/include/linux/jump_label.h +++ b/include/linux/jump_label.h @@ -158,6 +158,7 @@ extern void arch_jump_label_transform_static(struct jump_entry *entry, enum jump_label_type type); extern int jump_label_text_reserved(void *start, void *end); extern void static_key_slow_inc(struct static_key *key); +extern void static_key_slow_inc_cpuslocked(struct static_key *key); extern void static_key_slow_dec(struct static_key *key); extern void jump_label_apply_nops(struct module *mod); extern int static_key_count(struct static_key *key); @@ -213,6 +214,8 @@ static inline void static_key_slow_inc(struct static_key *key) atomic_inc(&key->enabled); } +#define static_key_slow_inc_cpuslocked static_key_slow_inc + static inline void static_key_slow_dec(struct static_key *key) { STATIC_KEY_CHECK_USE(); diff --git a/kernel/jump_label.c b/kernel/jump_label.c index f3afe07..308b12e 100644 --- a/kernel/jump_label.c +++ b/kernel/jump_label.c @@ -101,7 +101,7 @@ void static_key_disable(struct static_key *key) } EXPORT_SYMBOL_GPL(static_key_disable); -void static_key_slow_inc(struct static_key *key) +void __static_key_slow_inc(struct static_key *key) { int v, v1; @@ -130,7 +130,6 @@ void static_key_slow_inc(struct static_key *key) * the all CPUs, for that to be serialized against CPU hot-plug * we need to avoid CPUs coming online. */ - get_online_cpus(); jump_label_lock(); if (atomic_read(&key->enabled) == 0) { atomic_set(&key->enabled, -1); @@ -140,10 +139,22 @@ void static_key_slow_inc(struct static_key *key) atomic_inc(&key->enabled); } jump_label_unlock(); +} + +void static_key_slow_inc(struct static_key *key) +{ + get_online_cpus(); + __static_key_slow_inc(key); put_online_cpus(); } EXPORT_SYMBOL_GPL(static_key_slow_inc); +void static_key_slow_inc_cpuslocked(struct static_key *key) +{ + __static_key_slow_inc(key); +} +EXPORT_SYMBOL_GPL(static_key_slow_inc_cpuslocked); + static void __static_key_slow_dec(struct static_key *key, unsigned long rate_limit, struct delayed_work *work) { @@ -154,7 +165,6 @@ static void __static_key_slow_dec(struct static_key *key, * returns is unbalanced, because all other static_key_slow_inc() * instances block while the update is in progress. */ - get_online_cpus(); if (!atomic_dec_and_mutex_lock(&key->enabled, &jump_label_mutex)) { WARN(atomic_read(&key->enabled) < 0, "jump label: negative count!\n"); @@ -168,20 +178,23 @@ static void __static_key_slow_dec(struct static_key *key, jump_label_update(key); } jump_label_unlock(); - put_online_cpus(); } static void jump_label_update_timeout(struct work_struct *work) { struct static_key_deferred *key = container_of(work, struct static_key_deferred, work.work); + get_online_cpus(); __static_key_slow_dec(&key->key, 0, NULL); + put_online_cpus(); } void static_key_slow_dec(struct static_key *key) { STATIC_KEY_CHECK_USE(); + get_online_cpus(); __static_key_slow_dec(key, 0, NULL); + put_online_cpus(); } EXPORT_SYMBOL_GPL(static_key_slow_dec);
[tip:smp/hotplug] perf: Avoid cpu_hotplug_lock r-r recursion
Commit-ID: 641693094ee1568502280f95900f374b2226b51d Gitweb: http://git.kernel.org/tip/641693094ee1568502280f95900f374b2226b51d Author: Peter Zijlstra (Intel) AuthorDate: Tue, 18 Apr 2017 19:05:05 +0200 Committer: Thomas Gleixner CommitDate: Thu, 20 Apr 2017 13:08:57 +0200 perf: Avoid cpu_hotplug_lock r-r recursion There are two call-sites where using static_key results in recursing on the cpu_hotplug_lock. Use the hotplug locked version of static_key_slow_inc(). Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Thomas Gleixner Cc: Sebastian Siewior Cc: Steven Rostedt Cc: jba...@akamai.com Link: http://lkml.kernel.org/r/20170418103422.687248...@infradead.org --- kernel/events/core.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/events/core.c b/kernel/events/core.c index 634dd95..8aa3063 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -7653,7 +7653,7 @@ static int perf_swevent_init(struct perf_event *event) if (err) return err; - static_key_slow_inc(&perf_swevent_enabled[event_id]); + static_key_slow_inc_cpuslocked(&perf_swevent_enabled[event_id]); event->destroy = sw_perf_event_destroy; } @@ -9160,7 +9160,7 @@ static void account_event(struct perf_event *event) mutex_lock(&perf_sched_mutex); if (!atomic_read(&perf_sched_count)) { - static_branch_enable(&perf_sched_events); + static_key_slow_inc_cpuslocked(&perf_sched_events.key); /* * Guarantee that all CPUs observe they key change and * call the perf scheduling hooks before proceeding to