Re: [PATCH] mm/mapping_dirty_helpers: Guard hugepage pud's usage

2021-04-09 Thread Intel



On 4/8/21 9:15 PM, Zack Rusin wrote:

Lets make sure we don't use pud hugepage helpers on architectures
which do not support it. This fixes the code on arm64.

nits:

Perhaps be a little more specific about what it fixes? I figure it's a 
compilation failure?
Also please use imperative form: "Fix the code arm64" rather than "This 
fixes the code on arm64"


Other than that LGTM.

Reviewed-by: Thomas Hellström (Intel) 




Signed-off-by: Zack Rusin 
Cc: Andrew Morton 
Cc: Thomas Hellström (Intel) 
Cc: linux...@kvack.org
Cc: linux-kernel@vger.kernel.org
---
  mm/mapping_dirty_helpers.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/mm/mapping_dirty_helpers.c b/mm/mapping_dirty_helpers.c
index b59054ef2e10..b890854ec761 100644
--- a/mm/mapping_dirty_helpers.c
+++ b/mm/mapping_dirty_helpers.c
@@ -165,10 +165,12 @@ static int wp_clean_pud_entry(pud_t *pud, unsigned long 
addr, unsigned long end,
return 0;
}
  
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD

/* Huge pud */
walk->action = ACTION_CONTINUE;
if (pud_trans_huge(pudval) || pud_devmap(pudval))
WARN_ON(pud_write(pudval) || pud_dirty(pudval));
+#endif
  
  	return 0;

  }


Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages

2021-03-24 Thread Intel



On 3/24/21 10:58 AM, Daniel Vetter wrote:

On Tue, Mar 23, 2021 at 09:42:18PM +0100, Thomas Hellström (Intel) wrote:

On 3/23/21 8:52 PM, Williams, Dan J wrote:

On Sun, 2021-03-21 at 19:45 +0100, Thomas Hellström (Intel) wrote:

TTM sets up huge page-table-entries both to system- and device
memory,
and we don't want gup to assume there are always valid backing struct
pages for these. For PTEs this is handled by setting the pte_special
bit,
but for the huge PUDs and PMDs, we have neither pmd_special nor
pud_special. Normally, huge TTM entries are identified by looking at
vma_is_special_huge(), but fast gup can't do that, so as an
alternative
define _devmap entries for which there are no backing dev_pagemap as
special, update documentation and make huge TTM entries _devmap,
after
verifying that there is no backing dev_pagemap.

Please do not abuse p{m,u}d_devmap like this. I'm in the process of
removing get_devpagemap() from the gup-fast path [1]. Instead there
should be space for p{m,u}d_special in the page table entries (at least
for x86-64). So the fix is to remove that old assumption that huge
pages can never be special.

[1]:
http://lore.kernel.org/r/161604050866.1463742.7759521510383551055.st...@dwillia2-desk3.amr.corp.intel.com


Hmm, yes with that patch it will obviously not work as intended.

Given that, I think we'll need to disable the TTM huge pages for now until
we can sort out and agree on using a page table entry bit.

Yeah :-/

I think going full pud/pmd_mkspecial should then also mesh well with
Jason's request to wrap it all up into a vmf_insert_* helper, so at least
it would all look rather pretty in the end.


Yes, I agree. Seems like the special (SW1) is available also for huge 
page table entries on x86 AFAICT, although just not implemented. 
Otherwise the SW bits appear completely used up.


The PTE size vmf_insert_pfn__xxx functions either insert one of devmap 
or special.  I think the only users of the huge insert functions apart 
form TTM currently insert devmap so we should probably be able to do the 
same, and then DRM / TTM wouldn't need to care at all about special or not.


/Thomas




-Daniel


Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages

2021-03-24 Thread Intel



On 3/24/21 1:24 PM, Jason Gunthorpe wrote:

On Wed, Mar 24, 2021 at 10:56:43AM +0100, Daniel Vetter wrote:

On Tue, Mar 23, 2021 at 06:06:53PM +0100, Thomas Hellström (Intel) wrote:

On 3/23/21 5:37 PM, Jason Gunthorpe wrote:

On Tue, Mar 23, 2021 at 05:34:51PM +0100, Thomas Hellström (Intel) wrote:


@@ -210,6 +211,20 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault 
*vmf,
if ((pfn & (fault_page_size - 1)) != 0)
goto out_fallback;
+   /*
+* Huge entries must be special, that is marking them as devmap
+* with no backing device map range. If there is a backing
+* range, Don't insert a huge entry.
+* If this check turns out to be too much of a performance hit,
+* we can instead have drivers indicate whether they may have
+* backing device map ranges and if not, skip this lookup.
+*/

I think we can do this statically:
- if it's system memory we know there's no devmap for it, and we do the
 trick to block gup_fast

Yes, that should work.

- if it's iomem, we know gup_fast wont work anyway if don't set PFN_DEV,
 so might as well not do that

I think gup_fast will unfortunately mistake a huge iomem page for an
ordinary page and try to access a non-existant struct page for it, unless we
do the devmap trick.

And the lookup would then be for the rare case where a driver would have
already registered a dev_pagemap for an iomem area which may also be mapped
through TTM (like the patch from Felix a couple of weeks ago). If a driver
can promise not to do that, then we can safely remove the lookup.

Isn't the devmap PTE flag arch optional? Does this fall back to not
using huge pages on arches that don't support it?

Good point. No, currently it's only conditioned on transhuge page support.
Need to condition it on also devmap support.


Also, I feel like this code to install "pte_special" huge pages does
not belong in the drm subsystem..

I could add helpers in huge_memory.c:

vmf_insert_pfn_pmd_prot_special() and
vmf_insert_pfn_pud_prot_special()

The somewhat annoying thing is that we'd need an error code so we fall
back to pte fault handling. That's at least my understanding of how
pud/pmd fault handling works. Not sure how awkward that is going to be
with the overall fault handling flow.

But aside from that I think this makes tons of sense.

Why should the driver be so specific?

vmf_insert_pfn_range_XXX()

And it will figure out the optimal way to build the page tables.

Driver should provide the largest physically contiguous range it can


I figure that would probably work, but since the huge_fault() interface 
is already providing the size of the fault based on how the pagetable is 
currently populated I figure that would have to move a lot of that logic 
into that helper...


/Thomas




Jason


Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages

2021-03-24 Thread Intel



On 3/24/21 1:41 PM, Jason Gunthorpe wrote:

On Wed, Mar 24, 2021 at 01:35:17PM +0100, Thomas Hellström (Intel) wrote:

On 3/24/21 1:24 PM, Jason Gunthorpe wrote:

On Wed, Mar 24, 2021 at 10:56:43AM +0100, Daniel Vetter wrote:

On Tue, Mar 23, 2021 at 06:06:53PM +0100, Thomas Hellström (Intel) wrote:

On 3/23/21 5:37 PM, Jason Gunthorpe wrote:

On Tue, Mar 23, 2021 at 05:34:51PM +0100, Thomas Hellström (Intel) wrote:


@@ -210,6 +211,20 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault 
*vmf,
if ((pfn & (fault_page_size - 1)) != 0)
goto out_fallback;
+   /*
+* Huge entries must be special, that is marking them as devmap
+* with no backing device map range. If there is a backing
+* range, Don't insert a huge entry.
+* If this check turns out to be too much of a performance hit,
+* we can instead have drivers indicate whether they may have
+* backing device map ranges and if not, skip this lookup.
+*/

I think we can do this statically:
- if it's system memory we know there's no devmap for it, and we do the
  trick to block gup_fast

Yes, that should work.

- if it's iomem, we know gup_fast wont work anyway if don't set PFN_DEV,
  so might as well not do that

I think gup_fast will unfortunately mistake a huge iomem page for an
ordinary page and try to access a non-existant struct page for it, unless we
do the devmap trick.

And the lookup would then be for the rare case where a driver would have
already registered a dev_pagemap for an iomem area which may also be mapped
through TTM (like the patch from Felix a couple of weeks ago). If a driver
can promise not to do that, then we can safely remove the lookup.

Isn't the devmap PTE flag arch optional? Does this fall back to not
using huge pages on arches that don't support it?

Good point. No, currently it's only conditioned on transhuge page support.
Need to condition it on also devmap support.


Also, I feel like this code to install "pte_special" huge pages does
not belong in the drm subsystem..

I could add helpers in huge_memory.c:

vmf_insert_pfn_pmd_prot_special() and
vmf_insert_pfn_pud_prot_special()

The somewhat annoying thing is that we'd need an error code so we fall
back to pte fault handling. That's at least my understanding of how
pud/pmd fault handling works. Not sure how awkward that is going to be
with the overall fault handling flow.

But aside from that I think this makes tons of sense.

Why should the driver be so specific?

vmf_insert_pfn_range_XXX()

And it will figure out the optimal way to build the page tables.

Driver should provide the largest physically contiguous range it can

I figure that would probably work, but since the huge_fault() interface is
already providing the size of the fault based on how the pagetable is
currently populated I figure that would have to move a lot of that logic
into that helper...

But we don't really care about the size of the fault when we stuff the
pfns.

The device might use it when handling the fault, but once the fault is
handled the device knows what the contiguous pfn range is that it has
available to stuff into the page tables, it just tells the vmf_insert
what it was able to create, and it creates the necessary page table
structure.

The size of the hole in the page table is really only advisory, the
device may not want to make a 2M or 1G page entry and may prefer to
only create 4k.

In an ideal world the creation/destruction of page table levels would
by dynamic at this point, like THP.


Hmm, but I'm not sure what problem we're trying to solve by changing the 
interface in this way?


Currently if the core vm requests a huge pud, we give it one, and if we 
can't or don't want to (because of dirty-tracking, for example, which is 
always done on 4K page-level) we just return VM_FAULT_FALLBACK, and the 
fault is retried at a lower level. Also, determining whether we have a 
contigous range is not free, so we  don't want to do that unnecessarily.


/Thomas




Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages

2021-03-24 Thread Intel



On 3/24/21 2:48 PM, Jason Gunthorpe wrote:

On Wed, Mar 24, 2021 at 02:35:38PM +0100, Thomas Hellström (Intel) wrote:


In an ideal world the creation/destruction of page table levels would
by dynamic at this point, like THP.

Hmm, but I'm not sure what problem we're trying to solve by changing the
interface in this way?

We are trying to make a sensible driver API to deal with huge pages.
  

Currently if the core vm requests a huge pud, we give it one, and if we
can't or don't want to (because of dirty-tracking, for example, which is
always done on 4K page-level) we just return VM_FAULT_FALLBACK, and the
fault is retried at a lower level.

Well, my thought would be to move the pte related stuff into
vmf_insert_range instead of recursing back via VM_FAULT_FALLBACK.

I don't know if the locking works out, but it feels cleaner that the
driver tells the vmf how big a page it can stuff in, not the vm
telling the driver to stuff in a certain size page which it might not
want to do.

Some devices want to work on a in-between page size like 64k so they
can't form 2M pages but they can stuff 64k of 4K pages in a batch on
every fault.


Hmm, yes, but we would in that case be limited anyway to insert ranges 
smaller than and equal to the fault size to avoid extensive and possibly 
unnecessary checks for contigous memory. And then if we can't support 
the full fault size, we'd need to either presume a size and alignment of 
the next level or search for contigous memory in both directions around 
the fault address, perhaps unnecessarily as well. I do think the current 
interface works ok, as we're just acting on what the core vm tells us to do.


/Thomas



That idea doesn't fit naturally if the VM is driving the size.

Jason


Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages

2021-03-24 Thread Intel



On 3/24/21 7:31 PM, Christian König wrote:



Am 24.03.21 um 17:38 schrieb Jason Gunthorpe:
On Wed, Mar 24, 2021 at 04:50:14PM +0100, Thomas Hellström (Intel) 
wrote:

On 3/24/21 2:48 PM, Jason Gunthorpe wrote:
On Wed, Mar 24, 2021 at 02:35:38PM +0100, Thomas Hellström (Intel) 
wrote:


In an ideal world the creation/destruction of page table levels 
would

by dynamic at this point, like THP.
Hmm, but I'm not sure what problem we're trying to solve by 
changing the

interface in this way?

We are trying to make a sensible driver API to deal with huge pages.
Currently if the core vm requests a huge pud, we give it one, and 
if we
can't or don't want to (because of dirty-tracking, for example, 
which is
always done on 4K page-level) we just return VM_FAULT_FALLBACK, 
and the

fault is retried at a lower level.

Well, my thought would be to move the pte related stuff into
vmf_insert_range instead of recursing back via VM_FAULT_FALLBACK.

I don't know if the locking works out, but it feels cleaner that the
driver tells the vmf how big a page it can stuff in, not the vm
telling the driver to stuff in a certain size page which it might not
want to do.

Some devices want to work on a in-between page size like 64k so they
can't form 2M pages but they can stuff 64k of 4K pages in a batch on
every fault.

Hmm, yes, but we would in that case be limited anyway to insert ranges
smaller than and equal to the fault size to avoid extensive and 
possibly

unnecessary checks for contigous memory.

Why? The insert function is walking the page tables, it just updates
things as they are. It learns the arragement for free while doing the
walk.

The device has to always provide consistent data, if it overlaps into
pages that are already populated that is fine so long as it isn't
changing their addresses.


And then if we can't support the full fault size, we'd need to
either presume a size and alignment of the next level or search for
contigous memory in both directions around the fault address,
perhaps unnecessarily as well.

You don't really need to care about levels, the device should be
faulting in the largest memory regions it can within its efficiency.

If it works on 4M pages then it should be faulting 4M pages. The page
size of the underlying CPU doesn't really matter much other than some
tuning to impact how the device's allocator works.


Yes, but then we'd be adding a lot of complexity into this function that 
is already provided by the current interface for DAX, for little or no 
gain, at least in the drm/ttm setting. Please think of the following 
situation: You get a fault, you do an extensive time-consuming scan of 
your VRAM buffer object into which the fault goes and determine you can 
fault 1GB. Now you hand it to vmf_insert_range() and because the 
user-space address is misaligned, or already partly populated because of 
a previous eviction, you can only fault single pages, and you end up 
faulting a full GB of single pages perhaps for a one-time small update.


On top of this, unless we want to do the walk trying increasingly 
smaller sizes of vmf_insert_xxx(), we'd have to use 
apply_to_page_range() and teach it about transhuge page table entries, 
because pagewalk.c can't be used (It can't populate page tables). That 
also means apply_to_page_range() needs to be complicated with page table 
locks since transhuge pages aren't stable and can be zapped and 
refaulted under us while we do the walk.


On top of this, the user-space address allocator needs to know how large 
gpu pages are aligned in buffer objects to have a reasonable chance of 
aligning with CPU huge page boundaries which is a requirement to be able 
to insert a huge CPU page table entry, so the driver would basically 
need the drm helper that can do this alignment anyway.


All this makes me think we should settle for the current interface for 
now, and if someone feels like refining it, I'm fine with that.  After 
all, this isn't a strange drm/ttm invention, it's a pre-existing 
interface that we reuse.




I agree with Jason here.

We get the best efficiency when we look at the what the GPU driver 
provides and make sure that we handle one GPU page at once instead of 
looking to much into what the CPU is doing with it's page tables.


At least one AMD GPUs the GPU page size can be anything between 4KiB 
and 2GiB and if we will in a 2GiB chunk at once this can in theory be 
handled by just two giant page table entries on the CPU side.


Yes, but I fail to see why, with the current code, we can't do this 
(save the refcounting bug)?


/Thomas



Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages

2021-03-24 Thread Intel



On 3/24/21 5:34 PM, Dave Hansen wrote:

On 3/24/21 3:05 AM, Thomas Hellström (Intel) wrote:

Yes, I agree. Seems like the special (SW1) is available also for huge
page table entries on x86 AFAICT, although just not implemented.
Otherwise the SW bits appear completely used up.

Although the _PAGE_BIT_SOFTW* bits are used up, there's plenty of room
in the hardware PTEs.  Bits 52->58 are software-available, and we're
only using 58 at the moment.

We also have not been careful at *all* about how _PAGE_BIT_SOFTW* are
used.  It's quite possible we can encode another use even in the
existing bits.

Personally, I'd just try:

#define _PAGE_BIT_SOFTW557  /* available for programmer */

OK, I'll follow your advise here. FWIW I grepped for SW1 and it seems 
used in a selftest, but only for PTEs AFAICT.


Oh, and we don't care about 32-bit much anymore?

/Thomas




Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages

2021-03-25 Thread Intel



On 3/25/21 12:14 AM, Jason Gunthorpe wrote:

On Wed, Mar 24, 2021 at 09:07:53PM +0100, Thomas Hellström (Intel) wrote:

On 3/24/21 7:31 PM, Christian König wrote:


Am 24.03.21 um 17:38 schrieb Jason Gunthorpe:

On Wed, Mar 24, 2021 at 04:50:14PM +0100, Thomas Hellström (Intel)
wrote:

On 3/24/21 2:48 PM, Jason Gunthorpe wrote:

On Wed, Mar 24, 2021 at 02:35:38PM +0100, Thomas Hellström
(Intel) wrote:


In an ideal world the creation/destruction of page
table levels would
by dynamic at this point, like THP.

Hmm, but I'm not sure what problem we're trying to solve
by changing the
interface in this way?

We are trying to make a sensible driver API to deal with huge pages.

Currently if the core vm requests a huge pud, we give it
one, and if we
can't or don't want to (because of dirty-tracking, for
example, which is
always done on 4K page-level) we just return
VM_FAULT_FALLBACK, and the
fault is retried at a lower level.

Well, my thought would be to move the pte related stuff into
vmf_insert_range instead of recursing back via VM_FAULT_FALLBACK.

I don't know if the locking works out, but it feels cleaner that the
driver tells the vmf how big a page it can stuff in, not the vm
telling the driver to stuff in a certain size page which it might not
want to do.

Some devices want to work on a in-between page size like 64k so they
can't form 2M pages but they can stuff 64k of 4K pages in a batch on
every fault.

Hmm, yes, but we would in that case be limited anyway to insert ranges
smaller than and equal to the fault size to avoid extensive and
possibly
unnecessary checks for contigous memory.

Why? The insert function is walking the page tables, it just updates
things as they are. It learns the arragement for free while doing the
walk.

The device has to always provide consistent data, if it overlaps into
pages that are already populated that is fine so long as it isn't
changing their addresses.


And then if we can't support the full fault size, we'd need to
either presume a size and alignment of the next level or search for
contigous memory in both directions around the fault address,
perhaps unnecessarily as well.

You don't really need to care about levels, the device should be
faulting in the largest memory regions it can within its efficiency.

If it works on 4M pages then it should be faulting 4M pages. The page
size of the underlying CPU doesn't really matter much other than some
tuning to impact how the device's allocator works.

Yes, but then we'd be adding a lot of complexity into this function that is
already provided by the current interface for DAX, for little or no gain, at
least in the drm/ttm setting. Please think of the following situation: You
get a fault, you do an extensive time-consuming scan of your VRAM buffer
object into which the fault goes and determine you can fault 1GB. Now you
hand it to vmf_insert_range() and because the user-space address is
misaligned, or already partly populated because of a previous eviction, you
can only fault single pages, and you end up faulting a full GB of single
pages perhaps for a one-time small update.

Why would "you can only fault single pages" ever be true? If you have
1GB of pages then the vmf_insert_range should allocate enough page
table entries to consume it, regardless of alignment.


Ah yes, What I meant was you can only insert PTE size entries, either 
because of misalignment or because the page-table is alredy 
pre-populated with pmd size page directories, which you can't remove 
with only the read side of the mmap lock held.




And why shouldn't DAX switch to this kind of interface anyhow? It is
basically exactly the same problem. The underlying filesystem block
size is *not* necessarily aligned to the CPU page table sizes and DAX
would benefit from better handling of this mismatch.


First, I think we must sort out what "better handling" means. This is my 
takeout of the discussion so far:


Claimed Pros: of vmf_insert_range()
* We get an interface that doesn't require knowledge of CPU page table 
entry level sizes.
* We get the best efficiency when we look at what the GPU driver 
provides. (I disagree on this one).


Claimed Cons:
* A new implementation that may get complicated particularly if it 
involves modifying all of the DAX code
* The driver would have to know about those sizes anyway to get 
alignment right (Applies to DRM, because we mmap buffer objects, not 
physical address ranges. But not to DAX AFAICT),
* We loose efficiency, because we are prepared to spend an extra effort 
for alignment- and continuity checks when we know we can insert a huge 
page table entry, but not if we know we can't
* We loose efficiency because we might unnecessarily prefault a number 
of PTE size page-table entries (really a special case of the above one).


Now in the context of quickly fixing a critical bug, the choice IMHO 
becomes eas

Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages

2021-03-25 Thread Intel



On 3/25/21 9:27 AM, Christian König wrote:

Am 25.03.21 um 08:48 schrieb Thomas Hellström (Intel):


On 3/25/21 12:14 AM, Jason Gunthorpe wrote:
On Wed, Mar 24, 2021 at 09:07:53PM +0100, Thomas Hellström (Intel) 
wrote:

On 3/24/21 7:31 PM, Christian König wrote:


Am 24.03.21 um 17:38 schrieb Jason Gunthorpe:

On Wed, Mar 24, 2021 at 04:50:14PM +0100, Thomas Hellström (Intel)
wrote:

On 3/24/21 2:48 PM, Jason Gunthorpe wrote:

On Wed, Mar 24, 2021 at 02:35:38PM +0100, Thomas Hellström
(Intel) wrote:


In an ideal world the creation/destruction of page
table levels would
by dynamic at this point, like THP.

Hmm, but I'm not sure what problem we're trying to solve
by changing the
interface in this way?
We are trying to make a sensible driver API to deal with huge 
pages.

Currently if the core vm requests a huge pud, we give it
one, and if we
can't or don't want to (because of dirty-tracking, for
example, which is
always done on 4K page-level) we just return
VM_FAULT_FALLBACK, and the
fault is retried at a lower level.

Well, my thought would be to move the pte related stuff into
vmf_insert_range instead of recursing back via VM_FAULT_FALLBACK.

I don't know if the locking works out, but it feels cleaner 
that the

driver tells the vmf how big a page it can stuff in, not the vm
telling the driver to stuff in a certain size page which it 
might not

want to do.

Some devices want to work on a in-between page size like 64k so 
they
can't form 2M pages but they can stuff 64k of 4K pages in a 
batch on

every fault.
Hmm, yes, but we would in that case be limited anyway to insert 
ranges

smaller than and equal to the fault size to avoid extensive and
possibly
unnecessary checks for contigous memory.

Why? The insert function is walking the page tables, it just updates
things as they are. It learns the arragement for free while doing 
the

walk.

The device has to always provide consistent data, if it overlaps 
into

pages that are already populated that is fine so long as it isn't
changing their addresses.


And then if we can't support the full fault size, we'd need to
either presume a size and alignment of the next level or search for
contigous memory in both directions around the fault address,
perhaps unnecessarily as well.

You don't really need to care about levels, the device should be
faulting in the largest memory regions it can within its efficiency.

If it works on 4M pages then it should be faulting 4M pages. The 
page
size of the underlying CPU doesn't really matter much other than 
some

tuning to impact how the device's allocator works.
Yes, but then we'd be adding a lot of complexity into this function 
that is
already provided by the current interface for DAX, for little or no 
gain, at
least in the drm/ttm setting. Please think of the following 
situation: You
get a fault, you do an extensive time-consuming scan of your VRAM 
buffer
object into which the fault goes and determine you can fault 1GB. 
Now you

hand it to vmf_insert_range() and because the user-space address is
misaligned, or already partly populated because of a previous 
eviction, you
can only fault single pages, and you end up faulting a full GB of 
single

pages perhaps for a one-time small update.

Why would "you can only fault single pages" ever be true? If you have
1GB of pages then the vmf_insert_range should allocate enough page
table entries to consume it, regardless of alignment.


Ah yes, What I meant was you can only insert PTE size entries, either 
because of misalignment or because the page-table is alredy 
pre-populated with pmd size page directories, which you can't remove 
with only the read side of the mmap lock held.


Please explain that further. Why do we need the mmap lock to insert 
PMDs but not when insert PTEs?


We don't. But once you've inserted a PMD directory you can't remove it 
unless you have the mmap lock (and probably also the i_mmap_lock in 
write mode). That for example means that if you have a VRAM region 
mapped with huge PMDs, and then it gets evicted, and you happen to read 
a byte from it when it's evicted and therefore populate the full region 
with PTEs pointing to system pages, you can't go back to huge PMDs again 
without a munmap() in between.





And why shouldn't DAX switch to this kind of interface anyhow? It is
basically exactly the same problem. The underlying filesystem block
size is *not* necessarily aligned to the CPU page table sizes and DAX
would benefit from better handling of this mismatch.


First, I think we must sort out what "better handling" means. This is 
my takeout of the discussion so far:


Claimed Pros: of vmf_insert_range()
* We get an interface that doesn't require knowledge of CPU page 
table entry level sizes.
* We get the best efficiency when we look at what the GPU driver 
provides. (I disagree on this one).


Claimed Cons:

Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages

2021-03-25 Thread Intel



On 3/25/21 12:30 PM, Jason Gunthorpe wrote:

On Thu, Mar 25, 2021 at 10:51:35AM +0100, Thomas Hellström (Intel) wrote:


Please explain that further. Why do we need the mmap lock to insert PMDs
but not when insert PTEs?

We don't. But once you've inserted a PMD directory you can't remove it
unless you have the mmap lock (and probably also the i_mmap_lock in write
mode). That for example means that if you have a VRAM region mapped with
huge PMDs, and then it gets evicted, and you happen to read a byte from it
when it's evicted and therefore populate the full region with PTEs pointing
to system pages, you can't go back to huge PMDs again without a munmap() in
between.

This is all basically magic to me still, but THP does this
transformation and I think what it does could work here too. We
probably wouldn't be able to upgrade while handling fault, but at the
same time, this should be quite rare as it would require the driver to
have supplied a small page for this VMA at some point.


IIRC THP handles this using khugepaged, grabbing the lock in write mode 
when coalescing, and yeah, I don't think anything prevents anyone from 
extending khugepaged doing that also for special huge page table entries.





Apart from that I still don't fully get why we need this in the first
place.

Because virtual huge page address boundaries need to be aligned with
physical huge page address boundaries, and mmap can happen before bos are
populated so you have no way of knowing how physical huge page
address

But this is a mmap-time problem, fault can't fix mmap using the wrong VA.


Nope. The point here was that in this case, to make sure mmap uses the 
correct VA to give us a reasonable chance of alignement, the driver 
might need to be aware of and do trickery with the huge page-table-entry 
sizes anyway, although I think in most cases a standard helper for this 
can be supplied.


/Thomas





I really don't see that either. When a buffer is accessed by the CPU it
is in > 90% of all cases completely accessed. Not faulting in full
ranges is just optimizing for a really unlikely case here.

It might be that you're right, but are all drivers wanting to use this like
drm in this respect? Using the interface to fault in a 1G range in the hope
it could map it to a huge pud may unexpectedly consume and populate some 16+
MB of page tables.

If the underlying device block size is so big then sure, why not? The
"unexpectedly" should be quite rare/non existant anyhow.

Jason
  


Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages

2021-03-25 Thread Intel



On 3/25/21 1:09 PM, Christian König wrote:

Am 25.03.21 um 13:01 schrieb Jason Gunthorpe:
On Thu, Mar 25, 2021 at 12:53:15PM +0100, Thomas Hellström (Intel) 
wrote:



Nope. The point here was that in this case, to make sure mmap uses the
correct VA to give us a reasonable chance of alignement, the driver 
might
need to be aware of and do trickery with the huge page-table-entry 
sizes
anyway, although I think in most cases a standard helper for this 
can be

supplied.

Of course the driver needs some way to influence the VA mmap uses,
gernally it should align to the natural page size of the device


Well a mmap() needs to be aligned to the page size of the CPU, but not 
necessarily to the one of the device.


So I'm pretty sure the device driver should not be involved in any way 
the choosing of the VA for the CPU mapping.


Christian.

We've had this discussion before and at that time I managed to convince 
you by pointing to the shmem helper for this, shmem_get_umapped_area().


Basically there are two ways to do this. Either use a standard helper 
similar to shmem's, and then the driver needs to align physical (device) 
huge page boundaries to address space offset huge page boundaries. If 
you don't do that you can just as well use a custom function that 
adjusts for you not doing that (drm_get_unmapped_area()). Both require 
driver knowledge of the size of huge pages.


Without a function to adjust, mmap will use it's default (16 byte?) 
alignment and chance of alignment becomes very small.


/Thomas




Jason


Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages

2021-03-25 Thread Intel

Hi,

On 3/25/21 2:02 PM, Christian König wrote:



Am 25.03.21 um 13:36 schrieb Thomas Hellström (Intel):


On 3/25/21 1:09 PM, Christian König wrote:

Am 25.03.21 um 13:01 schrieb Jason Gunthorpe:
On Thu, Mar 25, 2021 at 12:53:15PM +0100, Thomas Hellström (Intel) 
wrote:


Nope. The point here was that in this case, to make sure mmap uses 
the
correct VA to give us a reasonable chance of alignement, the 
driver might
need to be aware of and do trickery with the huge page-table-entry 
sizes
anyway, although I think in most cases a standard helper for this 
can be

supplied.

Of course the driver needs some way to influence the VA mmap uses,
gernally it should align to the natural page size of the device


Well a mmap() needs to be aligned to the page size of the CPU, but 
not necessarily to the one of the device.


So I'm pretty sure the device driver should not be involved in any 
way the choosing of the VA for the CPU mapping.


Christian.

We've had this discussion before and at that time I managed to 
convince you by pointing to the shmem helper for this, 
shmem_get_umapped_area().


No, you didn't convinced me. I was just surprised that this is 
something under driver control.




Basically there are two ways to do this. Either use a standard helper 
similar to shmem's, and then the driver needs to align physical 
(device) huge page boundaries to address space offset huge page 
boundaries. If you don't do that you can just as well use a custom 
function that adjusts for you not doing that 
(drm_get_unmapped_area()). Both require driver knowledge of the size 
of huge pages.


And once more, at least for GPU drivers that looks like the totally 
wrong approach to me.


Aligning the VMA so that huge page allocations become possible is the 
job of the MM subsystem and not that of the drivers.



Previous discussion here

https://www.spinics.net/lists/linux-mm/msg205291.html



Without a function to adjust, mmap will use it's default (16 byte?) 
alignment and chance of alignment becomes very small.


Well it's 4KiB at least.

Yes :/ ...


Regards,
Christian.


Thanks,

Thomas




Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages

2021-03-25 Thread Intel



On 3/24/21 9:25 PM, Dave Hansen wrote:

On 3/24/21 1:22 PM, Thomas Hellström (Intel) wrote:

We also have not been careful at *all* about how _PAGE_BIT_SOFTW* are
used.  It's quite possible we can encode another use even in the
existing bits.

Personally, I'd just try:

#define _PAGE_BIT_SOFTW5    57  /* available for programmer */


OK, I'll follow your advise here. FWIW I grepped for SW1 and it seems
used in a selftest, but only for PTEs AFAICT.

Oh, and we don't care about 32-bit much anymore?

On x86, we have 64-bit PTEs when running 32-bit kernels if PAE is
enabled.  IOW, we can handle the majority of 32-bit CPUs out there.

But, yeah, we don't care about 32-bit. :)


Hmm,

Actually it makes some sense to use SW1, to make it end up in the same 
dword as the PSE bit, as from what I can tell, reading of a 64-bit pmd_t 
on 32-bit PAE is not atomic, so in theory a huge pmd could be modified 
while reading the pmd_t making the dwords inconsistent How does that 
work with fast gup anyway?


In any case, what would be the best cause of action here? Use SW1 or 
disable completely for 32-bit?


/Thomas





Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages

2021-03-25 Thread Intel



On 3/25/21 6:55 PM, Jason Gunthorpe wrote:

On Thu, Mar 25, 2021 at 06:51:26PM +0100, Thomas Hellström (Intel) wrote:

On 3/24/21 9:25 PM, Dave Hansen wrote:

On 3/24/21 1:22 PM, Thomas Hellström (Intel) wrote:

We also have not been careful at *all* about how _PAGE_BIT_SOFTW* are
used.  It's quite possible we can encode another use even in the
existing bits.

Personally, I'd just try:

#define _PAGE_BIT_SOFTW5    57  /* available for programmer */


OK, I'll follow your advise here. FWIW I grepped for SW1 and it seems
used in a selftest, but only for PTEs AFAICT.

Oh, and we don't care about 32-bit much anymore?

On x86, we have 64-bit PTEs when running 32-bit kernels if PAE is
enabled.  IOW, we can handle the majority of 32-bit CPUs out there.

But, yeah, we don't care about 32-bit. :)

Hmm,

Actually it makes some sense to use SW1, to make it end up in the same dword
as the PSE bit, as from what I can tell, reading of a 64-bit pmd_t on 32-bit
PAE is not atomic, so in theory a huge pmd could be modified while reading
the pmd_t making the dwords inconsistent How does that work with fast
gup anyway?

It loops to get an atomic 64 bit value if the arch can't provide an
atomic 64 bit load


Hmm, ok, I see a READ_ONCE() in gup_pmd_range(), and then the resulting 
pmd is dereferenced either in try_grab_compound_head() or 
__gup_device_huge(), before the pmd is compared to the value the pointer 
is currently pointing to. Couldn't those dereferences be on invalid 
pointers?


/Thomas



Jason


Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages

2021-03-25 Thread Intel



On 3/25/21 7:24 PM, Jason Gunthorpe wrote:

On Thu, Mar 25, 2021 at 07:13:33PM +0100, Thomas Hellström (Intel) wrote:

On 3/25/21 6:55 PM, Jason Gunthorpe wrote:

On Thu, Mar 25, 2021 at 06:51:26PM +0100, Thomas Hellström (Intel) wrote:

On 3/24/21 9:25 PM, Dave Hansen wrote:

On 3/24/21 1:22 PM, Thomas Hellström (Intel) wrote:

We also have not been careful at *all* about how _PAGE_BIT_SOFTW* are
used.  It's quite possible we can encode another use even in the
existing bits.

Personally, I'd just try:

#define _PAGE_BIT_SOFTW5    57  /* available for programmer */


OK, I'll follow your advise here. FWIW I grepped for SW1 and it seems
used in a selftest, but only for PTEs AFAICT.

Oh, and we don't care about 32-bit much anymore?

On x86, we have 64-bit PTEs when running 32-bit kernels if PAE is
enabled.  IOW, we can handle the majority of 32-bit CPUs out there.

But, yeah, we don't care about 32-bit. :)

Hmm,

Actually it makes some sense to use SW1, to make it end up in the same dword
as the PSE bit, as from what I can tell, reading of a 64-bit pmd_t on 32-bit
PAE is not atomic, so in theory a huge pmd could be modified while reading
the pmd_t making the dwords inconsistent How does that work with fast
gup anyway?

It loops to get an atomic 64 bit value if the arch can't provide an
atomic 64 bit load

Hmm, ok, I see a READ_ONCE() in gup_pmd_range(), and then the resulting pmd
is dereferenced either in try_grab_compound_head() or __gup_device_huge(),
before the pmd is compared to the value the pointer is currently pointing
to. Couldn't those dereferences be on invalid pointers?

Uh.. That does look questionable, yes. Unless there is some tricky
reason why a 64 bit pmd entry on a 32 bit arch either can't exist or
has a stable upper 32 bits..

The pte does it with ptep_get_lockless(), we probably need the same
for the other levels too instead of open coding a READ_ONCE?

Jason


Yes, unless that comment before local_irq_disable() means some magic is 
done to prevent bad things happening, but I guess if it's needed for 
ptes, it's probably needed for pmds and puds as well.


/Thomas




Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages

2021-03-23 Thread Intel



On 3/23/21 2:52 PM, Jason Gunthorpe wrote:

On Sun, Mar 21, 2021 at 07:45:28PM +0100, Thomas Hellström (Intel) wrote:

diff --git a/mm/gup.c b/mm/gup.c
index e40579624f10..1b6a127f0bdd 100644
+++ b/mm/gup.c
@@ -1993,6 +1993,17 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int 
nr_start,
  }
  
  #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL

+/*
+ * If we can't determine whether or not a pte is special, then fail immediately
+ * for ptes. Note, we can still pin HugeTLB as it is guaranteed not to be
+ * special. For THP, special huge entries are indicated by xxx_devmap()
+ * returning true, but a corresponding call to get_dev_pagemap() will
+ * return NULL.
+ *
+ * For a futex to be placed on a THP tail page, get_futex_key requires a
+ * get_user_pages_fast_only implementation that can pin pages. Thus it's still
+ * useful to have gup_huge_pmd even if we can't operate on ptes.
+ */

Why move this comment? I think it was correct where it was


Yes, you're right. I misread it to refer to the actual code in the 
gup_pte_range function rather than to the empty version. I'll move it back.


/Thomas




Jason


Re: [RFC PATCH 2/2] mm,drm/ttm: Use VM_PFNMAP for TTM vmas

2021-03-23 Thread Intel



On 3/23/21 3:00 PM, Jason Gunthorpe wrote:

On Sun, Mar 21, 2021 at 07:45:29PM +0100, Thomas Hellström (Intel) wrote:

To block fast gup we need to make sure TTM ptes are always special.
With MIXEDMAP we, on architectures that don't support pte_special,
insert normal ptes, but OTOH on those architectures, fast is not
supported.
At the same time, the function documentation to vm_normal_page() suggests
that ptes pointing to system memory pages of MIXEDMAP vmas are always
normal, but that doesn't seem consistent with what's implemented in
vmf_insert_mixed(). I'm thus not entirely sure this patch is actually
needed.

But to make sure and to avoid also normal (non-fast) gup, make all
TTM vmas PFNMAP. With PFNMAP we can't allow COW mappings
anymore so make is_cow_mapping() available and use it to reject
COW mappigs at mmap time.

There was previously a comment in the code that WC mappings together
with x86 PAT + PFNMAP was bad for performance. However from looking at
vmf_insert_mixed() it looks like in the current code PFNMAP and MIXEDMAP
are handled the same for architectures that support pte_special. This
means there should not be a performance difference anymore, but this
needs to be verified.

Cc: Christian Koenig 
Cc: David Airlie 
Cc: Daniel Vetter 
Cc: Andrew Morton 
Cc: Jason Gunthorpe 
Cc: linux...@kvack.org
Cc: dri-de...@lists.freedesktop.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Thomas Hellström (Intel) 
  drivers/gpu/drm/ttm/ttm_bo_vm.c | 22 --
  include/linux/mm.h  |  5 +
  mm/internal.h   |  5 -
  3 files changed, 13 insertions(+), 19 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c
index 1c34983480e5..708c6fb9be81 100644
+++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
@@ -372,12 +372,7 @@ vm_fault_t ttm_bo_vm_fault_reserved(struct vm_fault *vmf,
 * at arbitrary times while the data is mmap'ed.
 * See vmf_insert_mixed_prot() for a discussion.
 */
-   if (vma->vm_flags & VM_MIXEDMAP)
-   ret = vmf_insert_mixed_prot(vma, address,
-   __pfn_to_pfn_t(pfn, 
PFN_DEV),
-   prot);
-   else
-   ret = vmf_insert_pfn_prot(vma, address, pfn, prot);
+   ret = vmf_insert_pfn_prot(vma, address, pfn, prot);
  
  		/* Never error on prefaulted PTEs */

if (unlikely((ret & VM_FAULT_ERROR))) {
@@ -555,18 +550,14 @@ static void ttm_bo_mmap_vma_setup(struct 
ttm_buffer_object *bo, struct vm_area_s
 * Note: We're transferring the bo reference to
 * vma->vm_private_data here.
 */
-
vma->vm_private_data = bo;
  
  	/*

-* We'd like to use VM_PFNMAP on shared mappings, where
-* (vma->vm_flags & VM_SHARED) != 0, for performance reasons,
-* but for some reason VM_PFNMAP + x86 PAT + write-combine is very
-* bad for performance. Until that has been sorted out, use
-* VM_MIXEDMAP on all mappings. See freedesktop.org bug #75719
+* PFNMAP forces us to block COW mappings in mmap(),
+* and with MIXEDMAP we would incorrectly allow fast gup
+* on TTM memory on architectures that don't have pte_special.
 */
-   vma->vm_flags |= VM_MIXEDMAP;
-   vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
+   vma->vm_flags |= VM_PFNMAP | VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
  }
  
  int ttm_bo_mmap(struct file *filp, struct vm_area_struct *vma,

@@ -579,6 +570,9 @@ int ttm_bo_mmap(struct file *filp, struct vm_area_struct 
*vma,
if (unlikely(vma->vm_pgoff < DRM_FILE_PAGE_OFFSET_START))
return -EINVAL;
  
+	if (unlikely(is_cow_mapping(vma->vm_flags)))

+   return -EINVAL;
+
bo = ttm_bo_vm_lookup(bdev, vma->vm_pgoff, vma_pages(vma));
if (unlikely(!bo))
return -EINVAL;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 77e64e3eac80..c6ebf7f9ddbb 100644
+++ b/include/linux/mm.h
@@ -686,6 +686,11 @@ static inline bool vma_is_accessible(struct vm_area_struct 
*vma)
return vma->vm_flags & VM_ACCESS_FLAGS;
  }
  
+static inline bool is_cow_mapping(vm_flags_t flags)

+{
+   return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
+}

Most driver places are just banning VM_SHARED.

I see you copied this from remap_pfn_range(), but that logic is so
special I'm not sure..


It's actually used all over the place. Both in drivers and also 
redefined with
CONFIG_MEM_SOFT_DIRTY which makes me think Daniels idea of 
vma_is_cow_mapping() is better since it won't clash and cause 
compilation failures...




Can the user mprotect the write back on with the above logic?

No, it's blocked by mprotect.

Do we
n

Re: [RFC PATCH 2/2] mm,drm/ttm: Use VM_PFNMAP for TTM vmas

2021-03-23 Thread Intel



On 3/23/21 3:04 PM, Jason Gunthorpe wrote:

On Tue, Mar 23, 2021 at 12:47:24PM +0100, Daniel Vetter wrote:


+static inline bool is_cow_mapping(vm_flags_t flags)

Bit a bikeshed, but I wonder whether the public interface shouldn't be
vma_is_cow_mapping. Or whether this shouldn't be rejected somewhere else,
since at least in drivers/gpu we have tons of cases that don't check for
this and get it all kinds of wrong I think.

remap_pfn_range handles this for many cases, but by far not for all.

Anyway patch itself lgtm:

Reviewed-by: Daniel Vetter 

I would like it if io_remap_pfn_range() did not allow shared mappings
at all.


You mean private mappings?



IIRC it doesn't work anyway, the kernel can't reliably copy from IO
pages eg the "_copy_from_user_inatomic()" under cow_user_page() will
not work on s390 that requires all IO memory be accessed with special
instructions.

Unfortunately I have no idea what the long ago special case of
allowing COW'd IO mappings is. :\


Me neither, but at some point it must have been important enough to 
introduce VM_MIXEDMAP...


/Thomas



Jason


Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages

2021-03-23 Thread Intel

Hi,

On 3/23/21 12:34 PM, Daniel Vetter wrote:

On Sun, Mar 21, 2021 at 07:45:28PM +0100, Thomas Hellström (Intel) wrote:

TTM sets up huge page-table-entries both to system- and device memory,
and we don't want gup to assume there are always valid backing struct
pages for these. For PTEs this is handled by setting the pte_special bit,
but for the huge PUDs and PMDs, we have neither pmd_special nor
pud_special. Normally, huge TTM entries are identified by looking at
vma_is_special_huge(), but fast gup can't do that, so as an alternative
define _devmap entries for which there are no backing dev_pagemap as
special, update documentation and make huge TTM entries _devmap, after
verifying that there is no backing dev_pagemap.

One other alternative would be to block TTM huge page-table-entries
completely, and while currently only vmwgfx use them, they would be
beneficial to other graphis drivers moving forward as well.

Cc: Christian Koenig 
Cc: David Airlie 
Cc: Daniel Vetter 
Cc: Andrew Morton 
Cc: Jason Gunthorpe 
Cc: linux...@kvack.org
Cc: dri-de...@lists.freedesktop.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Thomas Hellström (Intel) 
---
  drivers/gpu/drm/ttm/ttm_bo_vm.c | 17 -
  mm/gup.c| 21 +++--
  mm/memremap.c   |  5 +
  3 files changed, 32 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c
index 6dc96cf66744..1c34983480e5 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
@@ -195,6 +195,7 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault 
*vmf,
pfn_t pfnt;
struct ttm_tt *ttm = bo->ttm;
bool write = vmf->flags & FAULT_FLAG_WRITE;
+   struct dev_pagemap *pagemap;
  
  	/* Fault should not cross bo boundary. */

page_offset &= ~(fault_page_size - 1);
@@ -210,6 +211,20 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault 
*vmf,
if ((pfn & (fault_page_size - 1)) != 0)
goto out_fallback;
  
+	/*

+* Huge entries must be special, that is marking them as devmap
+* with no backing device map range. If there is a backing
+* range, Don't insert a huge entry.
+* If this check turns out to be too much of a performance hit,
+* we can instead have drivers indicate whether they may have
+* backing device map ranges and if not, skip this lookup.
+*/

I think we can do this statically:
- if it's system memory we know there's no devmap for it, and we do the
   trick to block gup_fast

Yes, that should work.

- if it's iomem, we know gup_fast wont work anyway if don't set PFN_DEV,
   so might as well not do that


I think gup_fast will unfortunately mistake a huge iomem page for an 
ordinary page and try to access a non-existant struct page for it, 
unless we do the devmap trick.


And the lookup would then be for the rare case where a driver would have 
already registered a dev_pagemap for an iomem area which may also be 
mapped through TTM (like the patch from Felix a couple of weeks ago). If 
a driver can promise not to do that, then we can safely remove the lookup.


/Thomas




Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages

2021-03-23 Thread Intel



On 3/23/21 5:37 PM, Jason Gunthorpe wrote:

On Tue, Mar 23, 2021 at 05:34:51PM +0100, Thomas Hellström (Intel) wrote:


@@ -210,6 +211,20 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault 
*vmf,
if ((pfn & (fault_page_size - 1)) != 0)
goto out_fallback;
+   /*
+* Huge entries must be special, that is marking them as devmap
+* with no backing device map range. If there is a backing
+* range, Don't insert a huge entry.
+* If this check turns out to be too much of a performance hit,
+* we can instead have drivers indicate whether they may have
+* backing device map ranges and if not, skip this lookup.
+*/

I think we can do this statically:
- if it's system memory we know there's no devmap for it, and we do the
trick to block gup_fast

Yes, that should work.

- if it's iomem, we know gup_fast wont work anyway if don't set PFN_DEV,
so might as well not do that

I think gup_fast will unfortunately mistake a huge iomem page for an
ordinary page and try to access a non-existant struct page for it, unless we
do the devmap trick.

And the lookup would then be for the rare case where a driver would have
already registered a dev_pagemap for an iomem area which may also be mapped
through TTM (like the patch from Felix a couple of weeks ago). If a driver
can promise not to do that, then we can safely remove the lookup.

Isn't the devmap PTE flag arch optional? Does this fall back to not
using huge pages on arches that don't support it?


Good point. No, currently it's only conditioned on transhuge page support.
Need to condition it on also devmap support.



Also, I feel like this code to install "pte_special" huge pages does
not belong in the drm subsystem..


I could add helpers in huge_memory.c:

vmf_insert_pfn_pmd_prot_special() and
vmf_insert_pfn_pud_prot_special()

/Thomas



Jason


Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages

2021-03-23 Thread Intel



On 3/23/21 8:52 PM, Williams, Dan J wrote:

On Sun, 2021-03-21 at 19:45 +0100, Thomas Hellström (Intel) wrote:

TTM sets up huge page-table-entries both to system- and device
memory,
and we don't want gup to assume there are always valid backing struct
pages for these. For PTEs this is handled by setting the pte_special
bit,
but for the huge PUDs and PMDs, we have neither pmd_special nor
pud_special. Normally, huge TTM entries are identified by looking at
vma_is_special_huge(), but fast gup can't do that, so as an
alternative
define _devmap entries for which there are no backing dev_pagemap as
special, update documentation and make huge TTM entries _devmap,
after
verifying that there is no backing dev_pagemap.

Please do not abuse p{m,u}d_devmap like this. I'm in the process of
removing get_devpagemap() from the gup-fast path [1]. Instead there
should be space for p{m,u}d_special in the page table entries (at least
for x86-64). So the fix is to remove that old assumption that huge
pages can never be special.

[1]:
http://lore.kernel.org/r/161604050866.1463742.7759521510383551055.st...@dwillia2-desk3.amr.corp.intel.com


Hmm, yes with that patch it will obviously not work as intended.

Given that, I think we'll need to disable the TTM huge pages for now 
until we can sort out and agree on using a page table entry bit.


Thanks,

/Thomas




[RFC PATCH 2/2] mm,drm/ttm: Use VM_PFNMAP for TTM vmas

2021-03-21 Thread Intel
To block fast gup we need to make sure TTM ptes are always special.
With MIXEDMAP we, on architectures that don't support pte_special,
insert normal ptes, but OTOH on those architectures, fast is not
supported.
At the same time, the function documentation to vm_normal_page() suggests
that ptes pointing to system memory pages of MIXEDMAP vmas are always
normal, but that doesn't seem consistent with what's implemented in
vmf_insert_mixed(). I'm thus not entirely sure this patch is actually
needed.

But to make sure and to avoid also normal (non-fast) gup, make all
TTM vmas PFNMAP. With PFNMAP we can't allow COW mappings
anymore so make is_cow_mapping() available and use it to reject
COW mappigs at mmap time.

There was previously a comment in the code that WC mappings together
with x86 PAT + PFNMAP was bad for performance. However from looking at
vmf_insert_mixed() it looks like in the current code PFNMAP and MIXEDMAP
are handled the same for architectures that support pte_special. This
means there should not be a performance difference anymore, but this
needs to be verified.

Cc: Christian Koenig 
Cc: David Airlie 
Cc: Daniel Vetter 
Cc: Andrew Morton 
Cc: Jason Gunthorpe 
Cc: linux...@kvack.org
Cc: dri-de...@lists.freedesktop.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Thomas Hellström (Intel) 
---
 drivers/gpu/drm/ttm/ttm_bo_vm.c | 22 --
 include/linux/mm.h  |  5 +
 mm/internal.h   |  5 -
 3 files changed, 13 insertions(+), 19 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c
index 1c34983480e5..708c6fb9be81 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
@@ -372,12 +372,7 @@ vm_fault_t ttm_bo_vm_fault_reserved(struct vm_fault *vmf,
 * at arbitrary times while the data is mmap'ed.
 * See vmf_insert_mixed_prot() for a discussion.
 */
-   if (vma->vm_flags & VM_MIXEDMAP)
-   ret = vmf_insert_mixed_prot(vma, address,
-   __pfn_to_pfn_t(pfn, 
PFN_DEV),
-   prot);
-   else
-   ret = vmf_insert_pfn_prot(vma, address, pfn, prot);
+   ret = vmf_insert_pfn_prot(vma, address, pfn, prot);
 
/* Never error on prefaulted PTEs */
if (unlikely((ret & VM_FAULT_ERROR))) {
@@ -555,18 +550,14 @@ static void ttm_bo_mmap_vma_setup(struct 
ttm_buffer_object *bo, struct vm_area_s
 * Note: We're transferring the bo reference to
 * vma->vm_private_data here.
 */
-
vma->vm_private_data = bo;
 
/*
-* We'd like to use VM_PFNMAP on shared mappings, where
-* (vma->vm_flags & VM_SHARED) != 0, for performance reasons,
-* but for some reason VM_PFNMAP + x86 PAT + write-combine is very
-* bad for performance. Until that has been sorted out, use
-* VM_MIXEDMAP on all mappings. See freedesktop.org bug #75719
+* PFNMAP forces us to block COW mappings in mmap(),
+* and with MIXEDMAP we would incorrectly allow fast gup
+* on TTM memory on architectures that don't have pte_special.
 */
-   vma->vm_flags |= VM_MIXEDMAP;
-   vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
+   vma->vm_flags |= VM_PFNMAP | VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
 }
 
 int ttm_bo_mmap(struct file *filp, struct vm_area_struct *vma,
@@ -579,6 +570,9 @@ int ttm_bo_mmap(struct file *filp, struct vm_area_struct 
*vma,
if (unlikely(vma->vm_pgoff < DRM_FILE_PAGE_OFFSET_START))
return -EINVAL;
 
+   if (unlikely(is_cow_mapping(vma->vm_flags)))
+   return -EINVAL;
+
bo = ttm_bo_vm_lookup(bdev, vma->vm_pgoff, vma_pages(vma));
if (unlikely(!bo))
return -EINVAL;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 77e64e3eac80..c6ebf7f9ddbb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -686,6 +686,11 @@ static inline bool vma_is_accessible(struct vm_area_struct 
*vma)
return vma->vm_flags & VM_ACCESS_FLAGS;
 }
 
+static inline bool is_cow_mapping(vm_flags_t flags)
+{
+   return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
+}
+
 #ifdef CONFIG_SHMEM
 /*
  * The vma_is_shmem is not inline because it is used only by slow
diff --git a/mm/internal.h b/mm/internal.h
index 9902648f2206..1432feec62df 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -296,11 +296,6 @@ static inline unsigned int buddy_order(struct page *page)
  */
 #define buddy_order_unsafe(page)   READ_ONCE(page_private(page))
 
-static inline bool is_cow_mapping(vm_flags_t flags)
-{
-   return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
-}
-
 /*
  * These three helpers classifies VMAs for virtual memory accounting.
  */
-- 
2.30.2



[RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages

2021-03-21 Thread Intel
TTM sets up huge page-table-entries both to system- and device memory,
and we don't want gup to assume there are always valid backing struct
pages for these. For PTEs this is handled by setting the pte_special bit,
but for the huge PUDs and PMDs, we have neither pmd_special nor
pud_special. Normally, huge TTM entries are identified by looking at
vma_is_special_huge(), but fast gup can't do that, so as an alternative
define _devmap entries for which there are no backing dev_pagemap as
special, update documentation and make huge TTM entries _devmap, after
verifying that there is no backing dev_pagemap.

One other alternative would be to block TTM huge page-table-entries
completely, and while currently only vmwgfx use them, they would be
beneficial to other graphis drivers moving forward as well.

Cc: Christian Koenig 
Cc: David Airlie 
Cc: Daniel Vetter 
Cc: Andrew Morton 
Cc: Jason Gunthorpe 
Cc: linux...@kvack.org
Cc: dri-de...@lists.freedesktop.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Thomas Hellström (Intel) 
---
 drivers/gpu/drm/ttm/ttm_bo_vm.c | 17 -
 mm/gup.c| 21 +++--
 mm/memremap.c   |  5 +
 3 files changed, 32 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c
index 6dc96cf66744..1c34983480e5 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
@@ -195,6 +195,7 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault 
*vmf,
pfn_t pfnt;
struct ttm_tt *ttm = bo->ttm;
bool write = vmf->flags & FAULT_FLAG_WRITE;
+   struct dev_pagemap *pagemap;
 
/* Fault should not cross bo boundary. */
page_offset &= ~(fault_page_size - 1);
@@ -210,6 +211,20 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault 
*vmf,
if ((pfn & (fault_page_size - 1)) != 0)
goto out_fallback;
 
+   /*
+* Huge entries must be special, that is marking them as devmap
+* with no backing device map range. If there is a backing
+* range, Don't insert a huge entry.
+* If this check turns out to be too much of a performance hit,
+* we can instead have drivers indicate whether they may have
+* backing device map ranges and if not, skip this lookup.
+*/
+   pagemap = get_dev_pagemap(pfn, NULL);
+   if (pagemap) {
+   put_dev_pagemap(pagemap);
+   goto out_fallback;
+   }
+
/* Check that memory is contiguous. */
if (!bo->mem.bus.is_iomem) {
for (i = 1; i < fault_page_size; ++i) {
@@ -223,7 +238,7 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault 
*vmf,
}
}
 
-   pfnt = __pfn_to_pfn_t(pfn, PFN_DEV);
+   pfnt = __pfn_to_pfn_t(pfn, PFN_DEV | PFN_MAP);
if (fault_page_size == (HPAGE_PMD_SIZE >> PAGE_SHIFT))
ret = vmf_insert_pfn_pmd_prot(vmf, pfnt, pgprot, write);
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
diff --git a/mm/gup.c b/mm/gup.c
index e40579624f10..1b6a127f0bdd 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1993,6 +1993,17 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int 
nr_start,
 }
 
 #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
+/*
+ * If we can't determine whether or not a pte is special, then fail immediately
+ * for ptes. Note, we can still pin HugeTLB as it is guaranteed not to be
+ * special. For THP, special huge entries are indicated by xxx_devmap()
+ * returning true, but a corresponding call to get_dev_pagemap() will
+ * return NULL.
+ *
+ * For a futex to be placed on a THP tail page, get_futex_key requires a
+ * get_user_pages_fast_only implementation that can pin pages. Thus it's still
+ * useful to have gup_huge_pmd even if we can't operate on ptes.
+ */
 static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 unsigned int flags, struct page **pages, int *nr)
 {
@@ -2069,16 +2080,6 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, 
unsigned long end,
return ret;
 }
 #else
-
-/*
- * If we can't determine whether or not a pte is special, then fail immediately
- * for ptes. Note, we can still pin HugeTLB and THP as these are guaranteed not
- * to be special.
- *
- * For a futex to be placed on a THP tail page, get_futex_key requires a
- * get_user_pages_fast_only implementation that can pin pages. Thus it's still
- * useful to have gup_huge_pmd even if we can't operate on ptes.
- */
 static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 unsigned int flags, struct page **pages, int *nr)
 {
diff --git a/mm/memremap.c b/mm/memremap.c
index 7aa7d6e80ee5..757551cd2a4d 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -471,6 +471,11 @@ void vmem_altmap_free(struct vmem_altmap *altmap, unsigned 
long nr_pfns)
 

[RFC PATCH 0/2] mm,drm/ttm: Always block GUP to TTM pages

2021-03-21 Thread Intel
get_user_pages() to TTM pages is uwanted since TTM assumes it owns
the pages exclusively and / or sets up page-table mappings to io memory.

The first patch make sures we stop fast gup to huge TTM pages using
a trick with pmd_devmap() and pud_devmap() without a backing
dev_pagemap.

The second patch makes sure we block normal gup by setting VM_PFNMAP

Cc: Christian Koenig 
Cc: David Airlie 
Cc: Daniel Vetter 
Cc: Andrew Morton 
Cc: Jason Gunthorpe 
Cc: linux...@kvack.org
Cc: dri-de...@lists.freedesktop.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Thomas Hellström (Intel) 


Re: [RFC PATCH 2/2] mm,drm/ttm: Use VM_PFNMAP for TTM vmas

2021-03-22 Thread Intel

Hi!

On 3/22/21 8:47 AM, Christian König wrote:

Am 21.03.21 um 19:45 schrieb Thomas Hellström (Intel):

To block fast gup we need to make sure TTM ptes are always special.
With MIXEDMAP we, on architectures that don't support pte_special,
insert normal ptes, but OTOH on those architectures, fast is not
supported.
At the same time, the function documentation to vm_normal_page() 
suggests

that ptes pointing to system memory pages of MIXEDMAP vmas are always
normal, but that doesn't seem consistent with what's implemented in
vmf_insert_mixed(). I'm thus not entirely sure this patch is actually
needed.

But to make sure and to avoid also normal (non-fast) gup, make all
TTM vmas PFNMAP. With PFNMAP we can't allow COW mappings
anymore so make is_cow_mapping() available and use it to reject
COW mappigs at mmap time.


I would separate the disallowing of COW mapping from the PFN change. 
I'm pretty sure that COW mappings never worked on TTM BOs in the first 
place.


COW doesn't work with PFNMAP together with non-linear maps, so as a 
consequence from moving from MIXEDMAP to PFNMAP we must disallow COW, so 
it seems logical to me to do it in one patch.


And working COW was one of the tests I used for huge PMDs/PUDs, so it 
has indeed been working, but I can't think of any relevant use-cases.


Did you, BTW, have a chance to test this with WC mappings?

Thanks,
/Thomas





But either way this patch is Reviewed-by: Christian König 
.


Thanks,
Christian.



There was previously a comment in the code that WC mappings together
with x86 PAT + PFNMAP was bad for performance. However from looking at
vmf_insert_mixed() it looks like in the current code PFNMAP and MIXEDMAP
are handled the same for architectures that support pte_special. This
means there should not be a performance difference anymore, but this
needs to be verified.

Cc: Christian Koenig 
Cc: David Airlie 
Cc: Daniel Vetter 
Cc: Andrew Morton 
Cc: Jason Gunthorpe 
Cc: linux...@kvack.org
Cc: dri-de...@lists.freedesktop.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Thomas Hellström (Intel) 
---
  drivers/gpu/drm/ttm/ttm_bo_vm.c | 22 --
  include/linux/mm.h  |  5 +
  mm/internal.h   |  5 -
  3 files changed, 13 insertions(+), 19 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c 
b/drivers/gpu/drm/ttm/ttm_bo_vm.c

index 1c34983480e5..708c6fb9be81 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
@@ -372,12 +372,7 @@ vm_fault_t ttm_bo_vm_fault_reserved(struct 
vm_fault *vmf,

   * at arbitrary times while the data is mmap'ed.
   * See vmf_insert_mixed_prot() for a discussion.
   */
-    if (vma->vm_flags & VM_MIXEDMAP)
-    ret = vmf_insert_mixed_prot(vma, address,
-    __pfn_to_pfn_t(pfn, PFN_DEV),
-    prot);
-    else
-    ret = vmf_insert_pfn_prot(vma, address, pfn, prot);
+    ret = vmf_insert_pfn_prot(vma, address, pfn, prot);
    /* Never error on prefaulted PTEs */
  if (unlikely((ret & VM_FAULT_ERROR))) {
@@ -555,18 +550,14 @@ static void ttm_bo_mmap_vma_setup(struct 
ttm_buffer_object *bo, struct vm_area_s

   * Note: We're transferring the bo reference to
   * vma->vm_private_data here.
   */
-
  vma->vm_private_data = bo;
    /*
- * We'd like to use VM_PFNMAP on shared mappings, where
- * (vma->vm_flags & VM_SHARED) != 0, for performance reasons,
- * but for some reason VM_PFNMAP + x86 PAT + write-combine is very
- * bad for performance. Until that has been sorted out, use
- * VM_MIXEDMAP on all mappings. See freedesktop.org bug #75719
+ * PFNMAP forces us to block COW mappings in mmap(),
+ * and with MIXEDMAP we would incorrectly allow fast gup
+ * on TTM memory on architectures that don't have pte_special.
   */
-    vma->vm_flags |= VM_MIXEDMAP;
-    vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
+    vma->vm_flags |= VM_PFNMAP | VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
  }
    int ttm_bo_mmap(struct file *filp, struct vm_area_struct *vma,
@@ -579,6 +570,9 @@ int ttm_bo_mmap(struct file *filp, struct 
vm_area_struct *vma,

  if (unlikely(vma->vm_pgoff < DRM_FILE_PAGE_OFFSET_START))
  return -EINVAL;
  +    if (unlikely(is_cow_mapping(vma->vm_flags)))
+    return -EINVAL;
+
  bo = ttm_bo_vm_lookup(bdev, vma->vm_pgoff, vma_pages(vma));
  if (unlikely(!bo))
  return -EINVAL;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 77e64e3eac80..c6ebf7f9ddbb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -686,6 +686,11 @@ static inline bool vma_is_accessible(struct 
vm_area_struct *vma)

  return vma->vm_flags & VM_ACCESS_FLAGS;
  }
  +static inline bool is_cow_mapping(vm_flags_t flags)
+{
+    return (flags 

Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages

2021-03-26 Thread Intel



On 3/25/21 7:24 PM, Jason Gunthorpe wrote:

On Thu, Mar 25, 2021 at 07:13:33PM +0100, Thomas Hellström (Intel) wrote:

On 3/25/21 6:55 PM, Jason Gunthorpe wrote:

On Thu, Mar 25, 2021 at 06:51:26PM +0100, Thomas Hellström (Intel) wrote:

On 3/24/21 9:25 PM, Dave Hansen wrote:

On 3/24/21 1:22 PM, Thomas Hellström (Intel) wrote:

We also have not been careful at *all* about how _PAGE_BIT_SOFTW* are
used.  It's quite possible we can encode another use even in the
existing bits.

Personally, I'd just try:

#define _PAGE_BIT_SOFTW5    57  /* available for programmer */


OK, I'll follow your advise here. FWIW I grepped for SW1 and it seems
used in a selftest, but only for PTEs AFAICT.

Oh, and we don't care about 32-bit much anymore?

On x86, we have 64-bit PTEs when running 32-bit kernels if PAE is
enabled.  IOW, we can handle the majority of 32-bit CPUs out there.

But, yeah, we don't care about 32-bit. :)

Hmm,

Actually it makes some sense to use SW1, to make it end up in the same dword
as the PSE bit, as from what I can tell, reading of a 64-bit pmd_t on 32-bit
PAE is not atomic, so in theory a huge pmd could be modified while reading
the pmd_t making the dwords inconsistent How does that work with fast
gup anyway?

It loops to get an atomic 64 bit value if the arch can't provide an
atomic 64 bit load

Hmm, ok, I see a READ_ONCE() in gup_pmd_range(), and then the resulting pmd
is dereferenced either in try_grab_compound_head() or __gup_device_huge(),
before the pmd is compared to the value the pointer is currently pointing
to. Couldn't those dereferences be on invalid pointers?

Uh.. That does look questionable, yes. Unless there is some tricky
reason why a 64 bit pmd entry on a 32 bit arch either can't exist or
has a stable upper 32 bits..

The pte does it with ptep_get_lockless(), we probably need the same
for the other levels too instead of open coding a READ_ONCE?

Jason


TBH, ptep_get_lockless() also looks a bit fishy. it says
"it will not switch to a completely different present page without a TLB 
flush in between".


What if the following happens:

processor 1: Reads lower dword of PTE.
processor 2: Zaps PTE. Gets stuck waiting to do TLB flush
processor 1: Reads upper dword of PTE, which is now zero.
processor 3: Hits a TLB miss, reads an unpopulated PTE and faults in a 
new PTE value which happens to be the same as the original one before 
the zap.
processor 1: Reads the newly faulted in lower dword, compares to the old 
one, gives an OK and returns a bogus PTE.


/Thomas




Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages

2021-03-26 Thread Intel



On 3/26/21 12:46 PM, Jason Gunthorpe wrote:

On Fri, Mar 26, 2021 at 10:08:09AM +0100, Thomas Hellström (Intel) wrote:

On 3/25/21 7:24 PM, Jason Gunthorpe wrote:

On Thu, Mar 25, 2021 at 07:13:33PM +0100, Thomas Hellström (Intel) wrote:

On 3/25/21 6:55 PM, Jason Gunthorpe wrote:

On Thu, Mar 25, 2021 at 06:51:26PM +0100, Thomas Hellström (Intel) wrote:

On 3/24/21 9:25 PM, Dave Hansen wrote:

On 3/24/21 1:22 PM, Thomas Hellström (Intel) wrote:

We also have not been careful at *all* about how _PAGE_BIT_SOFTW* are
used.  It's quite possible we can encode another use even in the
existing bits.

Personally, I'd just try:

#define _PAGE_BIT_SOFTW5    57  /* available for programmer */


OK, I'll follow your advise here. FWIW I grepped for SW1 and it seems
used in a selftest, but only for PTEs AFAICT.

Oh, and we don't care about 32-bit much anymore?

On x86, we have 64-bit PTEs when running 32-bit kernels if PAE is
enabled.  IOW, we can handle the majority of 32-bit CPUs out there.

But, yeah, we don't care about 32-bit. :)

Hmm,

Actually it makes some sense to use SW1, to make it end up in the same dword
as the PSE bit, as from what I can tell, reading of a 64-bit pmd_t on 32-bit
PAE is not atomic, so in theory a huge pmd could be modified while reading
the pmd_t making the dwords inconsistent How does that work with fast
gup anyway?

It loops to get an atomic 64 bit value if the arch can't provide an
atomic 64 bit load

Hmm, ok, I see a READ_ONCE() in gup_pmd_range(), and then the resulting pmd
is dereferenced either in try_grab_compound_head() or __gup_device_huge(),
before the pmd is compared to the value the pointer is currently pointing
to. Couldn't those dereferences be on invalid pointers?

Uh.. That does look questionable, yes. Unless there is some tricky
reason why a 64 bit pmd entry on a 32 bit arch either can't exist or
has a stable upper 32 bits..

The pte does it with ptep_get_lockless(), we probably need the same
for the other levels too instead of open coding a READ_ONCE?

Jason

TBH, ptep_get_lockless() also looks a bit fishy. it says
"it will not switch to a completely different present page without a TLB
flush in between".

What if the following happens:

processor 1: Reads lower dword of PTE.
processor 2: Zaps PTE. Gets stuck waiting to do TLB flush
processor 1: Reads upper dword of PTE, which is now zero.
processor 3: Hits a TLB miss, reads an unpopulated PTE and faults in a new
PTE value which happens to be the same as the original one before the zap.
processor 1: Reads the newly faulted in lower dword, compares to the old
one, gives an OK and returns a bogus PTE.

So you are saying that while the zap will wait for the TLB flush to
globally finish once it gets started any other processor can still
write to the pte?

I can't think of any serialization that would cause fault to wait for
the zap/TLB flush, especially if the zap comes from the address_space
and doesn't hold the mmap lock.


I might of course be completely wrong, but It seems there is an 
assumption made that all potentially affected processors would have a 
valid TLB entry for the PTE. Then the fault would not happen (well 
unless of course the TLB flush completes on some processors before 
getting stuck on the local_irq_disable() on processor 1).


+CC: Nick Piggin

Seems like Nick Piggin is the original author of the comment. Perhaps he 
can can clarify a bit.


/Thomas




Seems worth bringing up in a bigger thread, maybe someone else knows?

Jason


Re: [PATCH 01/18] mm: Track mmu notifiers in fs_reclaim_acquire/release

2020-06-10 Thread Intel

Hi, Daniel,

Please see below.

On 6/4/20 10:12 AM, Daniel Vetter wrote:

fs_reclaim_acquire/release nicely catch recursion issues when
allocating GFP_KERNEL memory against shrinkers (which gpu drivers tend
to use to keep the excessive caches in check). For mmu notifier
recursions we do have lockdep annotations since 23b68395c7c7
("mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end").

But these only fire if a path actually results in some pte
invalidation - for most small allocations that's very rarely the case.
The other trouble is that pte invalidation can happen any time when
__GFP_RECLAIM is set. Which means only really GFP_ATOMIC is a safe
choice, GFP_NOIO isn't good enough to avoid potential mmu notifier
recursion.

I was pondering whether we should just do the general annotation, but
there's always the risk for false positives. Plus I'm assuming that
the core fs and io code is a lot better reviewed and tested than
random mmu notifier code in drivers. Hence why I decide to only
annotate for that specific case.

Furthermore even if we'd create a lockdep map for direct reclaim, we'd
still need to explicit pull in the mmu notifier map - there's a lot
more places that do pte invalidation than just direct reclaim, these
two contexts arent the same.

Note that the mmu notifiers needing their own independent lockdep map
is also the reason we can't hold them from fs_reclaim_acquire to
fs_reclaim_release - it would nest with the acquistion in the pte
invalidation code, causing a lockdep splat. And we can't remove the
annotations from pte invalidation and all the other places since
they're called from many other places than page reclaim. Hence we can
only do the equivalent of might_lock, but on the raw lockdep map.

With this we can also remove the lockdep priming added in 66204f1d2d1b
("mm/mmu_notifiers: prime lockdep") since the new annotations are
strictly more powerful.

Cc: Andrew Morton 
Cc: Jason Gunthorpe 
Cc: linux...@kvack.org
Cc: linux-r...@vger.kernel.org
Cc: Maarten Lankhorst 
Cc: Christian König 
Signed-off-by: Daniel Vetter 
---
This is part of a gpu lockdep annotation series simply because it
really helps to catch issues where gpu subsystem locks and primitives
can deadlock with themselves through allocations and mmu notifiers.
But aside from that motivation it should be completely free-standing,
and can land through -mm/-rdma/-hmm or any other tree really whenever.
-Daniel
---
  mm/mmu_notifier.c |  7 ---
  mm/page_alloc.c   | 23 ++-
  2 files changed, 14 insertions(+), 16 deletions(-)

diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 06852b896fa6..5d578b9122f8 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -612,13 +612,6 @@ int __mmu_notifier_register(struct mmu_notifier 
*subscription,
lockdep_assert_held_write(&mm->mmap_sem);
BUG_ON(atomic_read(&mm->mm_users) <= 0);
  
-	if (IS_ENABLED(CONFIG_LOCKDEP)) {

-   fs_reclaim_acquire(GFP_KERNEL);
-   lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
-   lock_map_release(&__mmu_notifier_invalidate_range_start_map);
-   fs_reclaim_release(GFP_KERNEL);
-   }
-
if (!mm->notifier_subscriptions) {
/*
 * kmalloc cannot be called under mm_take_all_locks(), but we
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 13cc653122b7..f8a222db4a53 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -57,6 +57,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  #include 
@@ -4124,7 +4125,7 @@ should_compact_retry(struct alloc_context *ac, unsigned 
int order, int alloc_fla
  static struct lockdep_map __fs_reclaim_map =
STATIC_LOCKDEP_MAP_INIT("fs_reclaim", &__fs_reclaim_map);
  
-static bool __need_fs_reclaim(gfp_t gfp_mask)

+static bool __need_reclaim(gfp_t gfp_mask)
  {
gfp_mask = current_gfp_context(gfp_mask);
  
@@ -4136,10 +4137,6 @@ static bool __need_fs_reclaim(gfp_t gfp_mask)

if (current->flags & PF_MEMALLOC)
return false;
  
-	/* We're only interested __GFP_FS allocations for now */

-   if (!(gfp_mask & __GFP_FS))
-   return false;
-
if (gfp_mask & __GFP_NOLOCKDEP)
return false;
  
@@ -4158,15 +4155,23 @@ void __fs_reclaim_release(void)
  
  void fs_reclaim_acquire(gfp_t gfp_mask)

  {
-   if (__need_fs_reclaim(gfp_mask))
-   __fs_reclaim_acquire();
+   if (__need_reclaim(gfp_mask)) {
+   if (!(gfp_mask & __GFP_FS))

Hmm. Shouldn't this be "if (gfp_mask & __GFP_FS)" or am I misunderstanding?

+   __fs_reclaim_acquire();



#ifdef CONFIG_MMU_NOTIFIER?


+
+   lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
+   lock_map_release(&__mmu_notifier_invalidate_range_start_map);
+
+   }
  }
  EXPORT_SYMBOL_GPL(fs_reclaim_acquire);
  
  void fs_reclaim_release(gfp_t gfp_mask)

  {
-  

Re: [PATCH 02/18] dma-buf: minor doc touch-ups

2020-06-10 Thread Intel



On 6/4/20 10:12 AM, Daniel Vetter wrote:

Just some tiny edits:
- fix link to struct dma_fence
- give slightly more meaningful title - the polling here is about
   implicit fences, explicit fences (in sync_file or drm_syncobj) also
   have their own polling

Signed-off-by: Daniel Vetter 


Reviewed-by: Thomas Hellstrom 




Re: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

2020-06-11 Thread Intel



On 6/4/20 10:12 AM, Daniel Vetter wrote:

Two in one go:
- it is allowed to call dma_fence_wait() while holding a
   dma_resv_lock(). This is fundamental to how eviction works with ttm,
   so required.

- it is allowed to call dma_fence_wait() from memory reclaim contexts,
   specifically from shrinker callbacks (which i915 does), and from mmu
   notifier callbacks (which amdgpu does, and which i915 sometimes also
   does, and probably always should, but that's kinda a debate). Also
   for stuff like HMM we really need to be able to do this, or things
   get real dicey.

Consequence is that any critical path necessary to get to a
dma_fence_signal for a fence must never a) call dma_resv_lock nor b)
allocate memory with GFP_KERNEL. Also by implication of
dma_resv_lock(), no userspace faulting allowed. That's some supremely
obnoxious limitations, which is why we need to sprinkle the right
annotations to all relevant paths.

The one big locking context we're leaving out here is mmu notifiers,
added in

commit 23b68395c7c78a764e8963fc15a7cfd318bf187f
Author: Daniel Vetter 
Date:   Mon Aug 26 22:14:21 2019 +0200

 mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end

that one covers a lot of other callsites, and it's also allowed to
wait on dma-fences from mmu notifiers. But there's no ready-made
functions exposed to prime this, so I've left it out for now.

v2: Also track against mmu notifier context.

v3: kerneldoc to spec the cross-driver contract. Note that currently
i915 throws in a hard-coded 10s timeout on foreign fences (not sure
why that was done, but it's there), which is why that rule is worded
with SHOULD instead of MUST.

Also some of the mmu_notifier/shrinker rules might surprise SoC
drivers, I haven't fully audited them all. Which is infeasible anyway,
we'll need to run them with lockdep and dma-fence annotations and see
what goes boom.

v4: A spelling fix from Mika

Cc: Mika Kuoppala 
Cc: Thomas Hellstrom 
Cc: linux-me...@vger.kernel.org
Cc: linaro-mm-...@lists.linaro.org
Cc: linux-r...@vger.kernel.org
Cc: amd-...@lists.freedesktop.org
Cc: intel-...@lists.freedesktop.org
Cc: Chris Wilson 
Cc: Maarten Lankhorst 
Cc: Christian König 
Signed-off-by: Daniel Vetter 
---
  Documentation/driver-api/dma-buf.rst |  6 
  drivers/dma-buf/dma-fence.c  | 41 
  drivers/dma-buf/dma-resv.c   |  4 +++
  include/linux/dma-fence.h|  1 +
  4 files changed, 52 insertions(+)


I still have my doubts about allowing fence waiting from within 
shrinkers. IMO ideally they should use a trywait approach, in order to 
allow memory allocation during command submission for drivers that
publish fences before command submission. (Since early reservation 
object release requires that).


But since drivers are already waiting from within shrinkers and I take 
your word for HMM requiring this,


Reviewed-by: Thomas Hellström 




Re: [RFC 02/17] dma-fence: basic lockdep annotations

2020-05-28 Thread Intel

On 2020-05-12 10:59, Daniel Vetter wrote:

Design is similar to the lockdep annotations for workers, but with
some twists:

- We use a read-lock for the execution/worker/completion side, so that
   this explicit annotation can be more liberally sprinkled around.
   With read locks lockdep isn't going to complain if the read-side
   isn't nested the same way under all circumstances, so ABBA deadlocks
   are ok. Which they are, since this is an annotation only.

- We're using non-recursive lockdep read lock mode, since in recursive
   read lock mode lockdep does not catch read side hazards. And we
   _very_ much want read side hazards to be caught. For full details of
   this limitation see

   commit e91498589746065e3ae95d9a00b068e525eec34f
   Author: Peter Zijlstra 
   Date:   Wed Aug 23 13:13:11 2017 +0200

   locking/lockdep/selftests: Add mixed read-write ABBA tests

- To allow nesting of the read-side explicit annotations we explicitly
   keep track of the nesting. lock_is_held() allows us to do that.

- The wait-side annotation is a write lock, and entirely done within
   dma_fence_wait() for everyone by default.

- To be able to freely annotate helper functions I want to make it ok
   to call dma_fence_begin/end_signalling from soft/hardirq context.
   First attempt was using the hardirq locking context for the write
   side in lockdep, but this forces all normal spinlocks nested within
   dma_fence_begin/end_signalling to be spinlocks. That bollocks.

   The approach now is to simple check in_atomic(), and for these cases
   entirely rely on the might_sleep() check in dma_fence_wait(). That
   will catch any wrong nesting against spinlocks from soft/hardirq
   contexts.

The idea here is that every code path that's critical for eventually
signalling a dma_fence should be annotated with
dma_fence_begin/end_signalling. The annotation ideally starts right
after a dma_fence is published (added to a dma_resv, exposed as a
sync_file fd, attached to a drm_syncobj fd, or anything else that
makes the dma_fence visible to other kernel threads), up to and
including the dma_fence_wait(). Examples are irq handlers, the
scheduler rt threads, the tail of execbuf (after the corresponding
fences are visible), any workers that end up signalling dma_fences and
really anything else. Not annotated should be code paths that only
complete fences opportunistically as the gpu progresses, like e.g.
shrinker/eviction code.

The main class of deadlocks this is supposed to catch are:

Thread A:

mutex_lock(A);
mutex_unlock(A);

dma_fence_signal();

Thread B:

mutex_lock(A);
dma_fence_wait();
mutex_unlock(A);

Thread B is blocked on A signalling the fence, but A never gets around
to that because it cannot acquire the lock A.

Note that dma_fence_wait() is allowed to be nested within
dma_fence_begin/end_signalling sections. To allow this to happen the
read lock needs to be upgraded to a write lock, which means that any
other lock is acquired between the dma_fence_begin_signalling() call and
the call to dma_fence_wait(), and still held, this will result in an
immediate lockdep complaint. The only other option would be to not
annotate such calls, defeating the point. Therefore these annotations
cannot be sprinkled over the code entirely mindless to avoid false
positives.

v2: handle soft/hardirq ctx better against write side and dont forget
EXPORT_SYMBOL, drivers can't use this otherwise.

Cc: linux-me...@vger.kernel.org
Cc: linaro-mm-...@lists.linaro.org
Cc: linux-r...@vger.kernel.org
Cc: amd-...@lists.freedesktop.org
Cc: intel-...@lists.freedesktop.org
Cc: Chris Wilson 
Cc: Maarten Lankhorst 
Cc: Christian König 
Signed-off-by: Daniel Vetter 


LGTM. Perhaps some in-code documentation on how to use the new functions 
are called.


Otherwise for patch 2 and 3,

Reviewed-by: Thomas Hellstrom 




Re: [PATCH 03/18] dma-fence: basic lockdep annotations

2020-06-04 Thread Intel



On 6/4/20 10:12 AM, Daniel Vetter wrote:
...

Thread A:

mutex_lock(A);
mutex_unlock(A);

dma_fence_signal();

Thread B:

mutex_lock(A);
dma_fence_wait();
mutex_unlock(A);

Thread B is blocked on A signalling the fence, but A never gets around
to that because it cannot acquire the lock A.

Note that dma_fence_wait() is allowed to be nested within
dma_fence_begin/end_signalling sections. To allow this to happen the
read lock needs to be upgraded to a write lock, which means that any
other lock is acquired between the dma_fence_begin_signalling() call and
the call to dma_fence_wait(), and still held, this will result in an
immediate lockdep complaint. The only other option would be to not
annotate such calls, defeating the point. Therefore these annotations
cannot be sprinkled over the code entirely mindless to avoid false
positives.


Just realized, isn't that example actually a true positive, or at least 
a great candidate for a true positive, since if another thread reenters 
that signaling path, it will block on that mutex, and the fence would 
never be signaled unless there is another signaling path?


Although I agree the conclusion is sound: These annotations cannot be 
sprinkled mindlessly over the code.


/Thomas








v2: handle soft/hardirq ctx better against write side and dont forget
EXPORT_SYMBOL, drivers can't use this otherwise.

v3: Kerneldoc.

v4: Some spelling fixes from Mika

Cc: Mika Kuoppala 
Cc: Thomas Hellstrom 
Cc: linux-me...@vger.kernel.org
Cc: linaro-mm-...@lists.linaro.org
Cc: linux-r...@vger.kernel.org
Cc: amd-...@lists.freedesktop.org
Cc: intel-...@lists.freedesktop.org
Cc: Chris Wilson 
Cc: Maarten Lankhorst 
Cc: Christian König 
Signed-off-by: Daniel Vetter 
---
  Documentation/driver-api/dma-buf.rst |  12 +-
  drivers/dma-buf/dma-fence.c  | 161 +++
  include/linux/dma-fence.h|  12 ++
  3 files changed, 182 insertions(+), 3 deletions(-)

diff --git a/Documentation/driver-api/dma-buf.rst 
b/Documentation/driver-api/dma-buf.rst
index 63dec76d1d8d..05d856131140 100644
--- a/Documentation/driver-api/dma-buf.rst
+++ b/Documentation/driver-api/dma-buf.rst
@@ -100,11 +100,11 @@ CPU Access to DMA Buffer Objects
  .. kernel-doc:: drivers/dma-buf/dma-buf.c
 :doc: cpu access
  
-Fence Poll Support

-~~
+Implicit Fence Poll Support
+~~~
  
  .. kernel-doc:: drivers/dma-buf/dma-buf.c

-   :doc: fence polling
+   :doc: implicit fence polling
  
  Kernel Functions and Structures Reference

  ~
@@ -133,6 +133,12 @@ DMA Fences
  .. kernel-doc:: drivers/dma-buf/dma-fence.c
 :doc: DMA fences overview
  
+DMA Fence Signalling Annotations

+
+
+.. kernel-doc:: drivers/dma-buf/dma-fence.c
+   :doc: fence signalling annotation
+
  DMA Fences Functions Reference
  ~~
  
diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c

index 656e9ac2d028..0005bc002529 100644
--- a/drivers/dma-buf/dma-fence.c
+++ b/drivers/dma-buf/dma-fence.c
@@ -110,6 +110,160 @@ u64 dma_fence_context_alloc(unsigned num)
  }
  EXPORT_SYMBOL(dma_fence_context_alloc);
  
+/**

+ * DOC: fence signalling annotation
+ *
+ * Proving correctness of all the kernel code around &dma_fence through code
+ * review and testing is tricky for a few reasons:
+ *
+ * * It is a cross-driver contract, and therefore all drivers must follow the
+ *   same rules for lock nesting order, calling contexts for various functions
+ *   and anything else significant for in-kernel interfaces. But it is also
+ *   impossible to test all drivers in a single machine, hence brute-force N 
vs.
+ *   N testing of all combinations is impossible. Even just limiting to the
+ *   possible combinations is infeasible.
+ *
+ * * There is an enormous amount of driver code involved. For render drivers
+ *   there's the tail of command submission, after fences are published,
+ *   scheduler code, interrupt and workers to process job completion,
+ *   and timeout, gpu reset and gpu hang recovery code. Plus for integration
+ *   with core mm with have &mmu_notifier, respectively &mmu_interval_notifier,
+ *   and &shrinker. For modesetting drivers there's the commit tail functions
+ *   between when fences for an atomic modeset are published, and when the
+ *   corresponding vblank completes, including any interrupt processing and
+ *   related workers. Auditing all that code, across all drivers, is not
+ *   feasible.
+ *
+ * * Due to how many other subsystems are involved and the locking hierarchies
+ *   this pulls in there is extremely thin wiggle-room for driver-specific
+ *   differences. &dma_fence interacts with almost all of the core memory
+ *   handling through page fault handlers via &dma_resv, dma_resv_lock() and
+ *   dma_resv_unlock(). On the other side it

Re: [PATCH] dma-fence: basic lockdep annotations

2020-06-05 Thread Intel
does a
   dma_fence_signal, but is still critical for reaching completion of
   fences. One example would be a scheduler kthread which picks up jobs
   and pushes them into hardware, where the interrupt handler or
   another completion thread calls dma_fence_signal(). But if the
   scheduler thread hangs, then all the fences hang, hence we need to
   manually annotate it. cross-release aimed to solve this by chaining
   cross-release dependencies, but the dependency from scheduler thread
   to the completion interrupt handler goes through hw where
   cross-release code can't observe it.

In short, without manual annotations and careful review of the start
and end of critical sections, cross-relese dependency tracking doesn't
work. We need explicit annotations.

v2: handle soft/hardirq ctx better against write side and dont forget
EXPORT_SYMBOL, drivers can't use this otherwise.

v3: Kerneldoc.

v4: Some spelling fixes from Mika

v5: Amend commit message to explain in detail why cross-release isn't
the solution.

Cc: Mika Kuoppala 
Cc: Thomas Hellstrom 
Cc: linux-me...@vger.kernel.org
Cc: linaro-mm-...@lists.linaro.org
Cc: linux-r...@vger.kernel.org
Cc: amd-...@lists.freedesktop.org
Cc: intel-...@lists.freedesktop.org
Cc: Chris Wilson 
Cc: Maarten Lankhorst 
Cc: Christian König 
Signed-off-by: Daniel Vetter 
---


Reviewed-by: Thomas Hellström 




[PATCH v4 2/2] selftests/x86: Add a test to detect infinite sigtrap handler loop

2025-06-05 Thread Xin Li (Intel)
When FRED is enabled, if the Trap Flag (TF) is set without an external
debugger attached, it can lead to an infinite loop in the SIGTRAP
handler.  To avoid this, the software event flag in the augmented SS
must be cleared, ensuring that no single-step trap remains pending when
ERETU completes.

This test checks for that specific scenario—verifying whether the kernel
correctly prevents an infinite SIGTRAP loop in this edge case when FRED
is enabled.

The test should _always_ pass with IDT event delivery, thus no need to
disable the test even when FRED is not enabled.

Signed-off-by: Xin Li (Intel) 
---

Changes in this version:
*) Address review comments from Sohil.
---
 tools/testing/selftests/x86/Makefile   |  2 +-
 tools/testing/selftests/x86/sigtrap_loop.c | 97 ++
 2 files changed, 98 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/x86/sigtrap_loop.c

diff --git a/tools/testing/selftests/x86/Makefile 
b/tools/testing/selftests/x86/Makefile
index f703fcfe9f7c..83148875a12c 100644
--- a/tools/testing/selftests/x86/Makefile
+++ b/tools/testing/selftests/x86/Makefile
@@ -12,7 +12,7 @@ CAN_BUILD_WITH_NOPIE := $(shell ./check_cc.sh "$(CC)" 
trivial_program.c -no-pie)
 
 TARGETS_C_BOTHBITS := single_step_syscall sysret_ss_attrs syscall_nt 
test_mremap_vdso \
check_initial_reg_state sigreturn iopl ioperm \
-   test_vsyscall mov_ss_trap \
+   test_vsyscall mov_ss_trap sigtrap_loop \
syscall_arg_fault fsgsbase_restore sigaltstack
 TARGETS_C_BOTHBITS += nx_stack
 TARGETS_C_32BIT_ONLY := entry_from_vm86 test_syscall_vdso unwind_vdso \
diff --git a/tools/testing/selftests/x86/sigtrap_loop.c 
b/tools/testing/selftests/x86/sigtrap_loop.c
new file mode 100644
index ..dfb05769551d
--- /dev/null
+++ b/tools/testing/selftests/x86/sigtrap_loop.c
@@ -0,0 +1,97 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2025 Intel Corporation
+ */
+#define _GNU_SOURCE
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#ifdef __x86_64__
+# define REG_IP REG_RIP
+#else
+# define REG_IP REG_EIP
+#endif
+
+static void sethandler(int sig, void (*handler)(int, siginfo_t *, void *), int 
flags)
+{
+   struct sigaction sa;
+
+   memset(&sa, 0, sizeof(sa));
+   sa.sa_sigaction = handler;
+   sa.sa_flags = SA_SIGINFO | flags;
+   sigemptyset(&sa.sa_mask);
+
+   if (sigaction(sig, &sa, 0))
+   err(1, "sigaction");
+
+   return;
+}
+
+static unsigned int loop_count_on_same_ip;
+
+static void sigtrap(int sig, siginfo_t *info, void *ctx_void)
+{
+   ucontext_t *ctx = (ucontext_t *)ctx_void;
+   static unsigned long last_trap_ip;
+
+   if (last_trap_ip == ctx->uc_mcontext.gregs[REG_IP]) {
+   printf("\tTrapped at %016lx\n", last_trap_ip);
+
+   /*
+* If the same IP is hit more than 10 times in a row, it is
+* _considered_ an infinite loop.
+*/
+   if (++loop_count_on_same_ip > 10) {
+   printf("[FAIL]\tDetected sigtrap infinite loop\n");
+   exit(1);
+   }
+
+   return;
+   }
+
+   loop_count_on_same_ip = 0;
+   last_trap_ip = ctx->uc_mcontext.gregs[REG_IP];
+   printf("\tTrapped at %016lx\n", last_trap_ip);
+}
+
+int main(int argc, char *argv[])
+{
+   sethandler(SIGTRAP, sigtrap, 0);
+
+   /*
+* Set the Trap Flag (TF) to single-step the test code, therefore to
+* trigger a SIGTRAP signal after each instruction until the TF is
+* cleared.
+*
+* Because the arithmetic flags are not significant here, the TF is
+* set by pushing 0x302 onto the stack and then popping it into the
+* flags register.
+*
+* Four instructions in the following asm code are executed with the
+* TF set, thus the SIGTRAP handler is expected to run four times.
+*/
+   printf("[RUN]\tsigtrap infinite loop detection\n");
+   asm volatile(
+#ifdef __x86_64__
+   /* Avoid clobbering the redzone */
+   "sub $128, %rsp\n\t"
+#endif
+   "push $0x302\n\t"
+   "popf\n\t"
+   "nop\n\t"
+   "nop\n\t"
+   "push $0x202\n\t"
+   "popf\n\t"
+#ifdef __x86_64__
+   "add $128, %rsp\n\t"
+#endif
+   );
+
+   printf("[OK]\tNo sigtrap infinite loop detected\n");
+   return 0;
+}
-- 
2.49.0




[PATCH v4 1/2] x86/fred/signal: Prevent single-step upon ERETU completion

2025-06-05 Thread Xin Li (Intel)
Clear the software event flag in the augmented SS to prevent infinite
SIGTRAP handler loop if the trap flag (TF) is set without an external
debugger attached.

Following is a typical single-stepping flow for a user process:

1) The user process is prepared for single-stepping by setting
   RFLAGS.TF = 1.
2) When any instruction in user space completes, a #DB is triggered.
3) The kernel handles the #DB and returns to user space, invoking the
   SIGTRAP handler with RFLAGS.TF = 0.
4) After the SIGTRAP handler finishes, the user process performs a
   sigreturn syscall, restoring the original state, including
   RFLAGS.TF = 1.
5) Goto step 2.

According to the FRED specification:

A) Bit 17 in the augmented SS is designated as the software event
   flag, which is set to 1 for FRED event delivery of SYSCALL,
   SYSENTER, or INT n.
B) If bit 17 of the augmented SS is 1 and ERETU would result in
   RFLAGS.TF = 1, a single-step trap will be pending upon completion
   of ERETU.

In step 4) above, the software event flag is set upon the sigreturn
syscall, and its corresponding ERETU would restore RFLAGS.TF = 1.
This combination causes a pending single-step trap upon completion of
ERETU.  Therefore, another #DB is triggered before any user space
instruction is executed, which leads to an infinite loop in which the
SIGTRAP handler keeps being invoked on the same user space IP.

Suggested-by: H. Peter Anvin (Intel) 
Signed-off-by: Xin Li (Intel) 
Cc: sta...@vger.kernel.org
---

Change in v3:
*) Use "#ifdef CONFIG_X86_FRED" instead of IS_ENABLED(CONFIG_X86_FRED)
   (Intel LKP).

Change in v2:
*) Remove the check cpu_feature_enabled(X86_FEATURE_FRED), because
   regs->fred_ss.swevent will always be 0 otherwise (H. Peter Anvin).
---
 arch/x86/include/asm/sighandling.h | 22 ++
 arch/x86/kernel/signal_32.c|  4 
 arch/x86/kernel/signal_64.c|  4 
 3 files changed, 30 insertions(+)

diff --git a/arch/x86/include/asm/sighandling.h 
b/arch/x86/include/asm/sighandling.h
index e770c4fc47f4..b8481d33ba8e 100644
--- a/arch/x86/include/asm/sighandling.h
+++ b/arch/x86/include/asm/sighandling.h
@@ -24,4 +24,26 @@ int ia32_setup_rt_frame(struct ksignal *ksig, struct pt_regs 
*regs);
 int x64_setup_rt_frame(struct ksignal *ksig, struct pt_regs *regs);
 int x32_setup_rt_frame(struct ksignal *ksig, struct pt_regs *regs);
 
+/*
+ * To prevent infinite SIGTRAP handler loop if the trap flag (TF) is set
+ * without an external debugger attached, clear the software event flag in
+ * the augmented SS, ensuring no single-step trap is pending upon ERETU
+ * completion.
+ *
+ * Note, this function should be called in sigreturn() before the original
+ * state is restored to make sure the TF is read from the entry frame.
+ */
+static __always_inline void prevent_single_step_upon_eretu(struct pt_regs 
*regs)
+{
+   /*
+* If the trap flag (TF) is set, i.e., the sigreturn() SYSCALL 
instruction
+* is being single-stepped, do not clear the software event flag in the
+* augmented SS, thus a debugger won't skip over the following 
instruction.
+*/
+#ifdef CONFIG_X86_FRED
+   if (!(regs->flags & X86_EFLAGS_TF))
+   regs->fred_ss.swevent = 0;
+#endif
+}
+
 #endif /* _ASM_X86_SIGHANDLING_H */
diff --git a/arch/x86/kernel/signal_32.c b/arch/x86/kernel/signal_32.c
index 98123ff10506..42bbc42bd350 100644
--- a/arch/x86/kernel/signal_32.c
+++ b/arch/x86/kernel/signal_32.c
@@ -152,6 +152,8 @@ SYSCALL32_DEFINE0(sigreturn)
struct sigframe_ia32 __user *frame = (struct sigframe_ia32 __user 
*)(regs->sp-8);
sigset_t set;
 
+   prevent_single_step_upon_eretu(regs);
+
if (!access_ok(frame, sizeof(*frame)))
goto badframe;
if (__get_user(set.sig[0], &frame->sc.oldmask)
@@ -175,6 +177,8 @@ SYSCALL32_DEFINE0(rt_sigreturn)
struct rt_sigframe_ia32 __user *frame;
sigset_t set;
 
+   prevent_single_step_upon_eretu(regs);
+
frame = (struct rt_sigframe_ia32 __user *)(regs->sp - 4);
 
if (!access_ok(frame, sizeof(*frame)))
diff --git a/arch/x86/kernel/signal_64.c b/arch/x86/kernel/signal_64.c
index ee9453891901..d483b585c6c6 100644
--- a/arch/x86/kernel/signal_64.c
+++ b/arch/x86/kernel/signal_64.c
@@ -250,6 +250,8 @@ SYSCALL_DEFINE0(rt_sigreturn)
sigset_t set;
unsigned long uc_flags;
 
+   prevent_single_step_upon_eretu(regs);
+
frame = (struct rt_sigframe __user *)(regs->sp - sizeof(long));
if (!access_ok(frame, sizeof(*frame)))
goto badframe;
@@ -366,6 +368,8 @@ COMPAT_SYSCALL_DEFINE0(x32_rt_sigreturn)
sigset_t set;
unsigned long uc_flags;
 
+   prevent_single_step_upon_eretu(regs);
+
frame = (struct rt_sigframe_x32 __user *)(regs->sp - 8);
 
if (!access_ok(frame, sizeof(*frame)))
-- 
2.49.0




[PATCH v4 0/2] x86/fred: Prevent single-step upon ERETU completion

2025-06-05 Thread Xin Li (Intel)
IDT event delivery has a debug hole in which it does not generate #DB
upon returning to userspace before the first userspace instruction is
executed if the Trap Flag (TF) is set.

FRED closes this hole by introducing a software event flag, i.e., bit
17 of the augmented SS: if the bit is set and ERETU would result in
RFLAGS.TF = 1, a single-step trap will be pending upon completion of
ERETU.

However I overlooked properly setting and clearing the bit in different
situations.  Thus when FRED is enabled, if the Trap Flag (TF) is set
without an external debugger attached, it can lead to an infinite loop
in the SIGTRAP handler.  To avoid this, the software event flag in the
augmented SS must be cleared, ensuring that no single-step trap remains
pending when ERETU completes.

This patch set combines the fix [1] and its corresponding selftest [2]
(requested by Dave Hansen) into one patch set.

[1] https://lore.kernel.org/lkml/20250523050153.3308237-1-...@zytor.com/
[2] https://lore.kernel.org/lkml/20250530230707.2528916-1-...@zytor.com/


This patch set is based on tip/x86/urgent branch as of today.


Xin Li (Intel) (2):
  x86/fred/signal: Prevent single-step upon ERETU completion
  selftests/x86: Add a test to detect infinite sigtrap handler loop

 arch/x86/include/asm/sighandling.h | 22 +
 arch/x86/kernel/signal_32.c|  4 +
 arch/x86/kernel/signal_64.c|  4 +
 tools/testing/selftests/x86/Makefile   |  2 +-
 tools/testing/selftests/x86/sigtrap_loop.c | 97 ++
 5 files changed, 128 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/x86/sigtrap_loop.c


base-commit: dd2922dcfaa3296846265e113309e5f7f138839f
-- 
2.49.0




[PATCH v5 0/2] x86/fred: Prevent immediate repeat of single step trap on return from SIGTRAP handler

2025-06-06 Thread Xin Li (Intel)
IDT event delivery has a debug hole in which it does not generate #DB
upon returning to userspace before the first userspace instruction is
executed if the Trap Flag (TF) is set.

FRED closes this hole by introducing a software event flag, i.e., bit
17 of the augmented SS: if the bit is set and ERETU would result in
RFLAGS.TF = 1, a single-step trap will be pending upon completion of
ERETU.

However I overlooked properly setting and clearing the bit in different
situations.  Thus when FRED is enabled, if the Trap Flag (TF) is set
without an external debugger attached, it can lead to an infinite loop
in the SIGTRAP handler.  To avoid this, the software event flag in the
augmented SS must be cleared, ensuring that no single-step trap remains
pending when ERETU completes.

This patch set combines the fix [1] and its corresponding selftest [2]
(requested by Dave Hansen) into one patch set.

[1] https://lore.kernel.org/lkml/20250523050153.3308237-1-...@zytor.com/
[2] https://lore.kernel.org/lkml/20250530230707.2528916-1-...@zytor.com/


This patch set is based on tip/x86/urgent branch as of today.


Link to v4 of this patch set:
https://lore.kernel.org/lkml/20250605181020.590459-1-...@zytor.com/

Changes in v5:
*) Accurately rephrase the shortlog (hpa).
*) Do "sub $-128, %rsp" rather than "add $128, %rsp", which is more
   efficient in code size (hpa).
*) Add TB from Sohil.
*) Add Cc: sta...@vger.kernel.org to all patches.


Xin Li (Intel) (2):
  x86/fred/signal: Prevent immediate repeat of single step trap on
return from SIGTRAP handler
  selftests/x86: Add a test to detect infinite sigtrap handler loop

 arch/x86/include/asm/sighandling.h | 22 +
 arch/x86/kernel/signal_32.c|  4 +
 arch/x86/kernel/signal_64.c|  4 +
 tools/testing/selftests/x86/Makefile   |  2 +-
 tools/testing/selftests/x86/sigtrap_loop.c | 98 ++
 5 files changed, 129 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/x86/sigtrap_loop.c


base-commit: dd2922dcfaa3296846265e113309e5f7f138839f
-- 
2.49.0




[PATCH v5 1/2] x86/fred/signal: Prevent immediate repeat of single step trap on return from SIGTRAP handler

2025-06-06 Thread Xin Li (Intel)
Clear the software event flag in the augmented SS to prevent immediate
repeat of single step trap on return from SIGTRAP handler if the trap
flag (TF) is set without an external debugger attached.

Following is a typical single-stepping flow for a user process:

1) The user process is prepared for single-stepping by setting
   RFLAGS.TF = 1.
2) When any instruction in user space completes, a #DB is triggered.
3) The kernel handles the #DB and returns to user space, invoking the
   SIGTRAP handler with RFLAGS.TF = 0.
4) After the SIGTRAP handler finishes, the user process performs a
   sigreturn syscall, restoring the original state, including
   RFLAGS.TF = 1.
5) Goto step 2.

According to the FRED specification:

A) Bit 17 in the augmented SS is designated as the software event
   flag, which is set to 1 for FRED event delivery of SYSCALL,
   SYSENTER, or INT n.
B) If bit 17 of the augmented SS is 1 and ERETU would result in
   RFLAGS.TF = 1, a single-step trap will be pending upon completion
   of ERETU.

In step 4) above, the software event flag is set upon the sigreturn
syscall, and its corresponding ERETU would restore RFLAGS.TF = 1.
This combination causes a pending single-step trap upon completion of
ERETU.  Therefore, another #DB is triggered before any user space
instruction is executed, which leads to an infinite loop in which the
SIGTRAP handler keeps being invoked on the same user space IP.

Suggested-by: H. Peter Anvin (Intel) 
Signed-off-by: Xin Li (Intel) 
Cc: sta...@vger.kernel.org
---

Change in v5:
*) Accurately rephrase the shortlog (hpa).

Change in v4:
*) Add a selftest to the patch set (Dave Hansen).

Change in v3:
*) Use "#ifdef CONFIG_X86_FRED" instead of IS_ENABLED(CONFIG_X86_FRED)
   (Intel LKP).

Change in v2:
*) Remove the check cpu_feature_enabled(X86_FEATURE_FRED), because
   regs->fred_ss.swevent will always be 0 otherwise (hpa).
---
 arch/x86/include/asm/sighandling.h | 22 ++
 arch/x86/kernel/signal_32.c|  4 
 arch/x86/kernel/signal_64.c|  4 
 3 files changed, 30 insertions(+)

diff --git a/arch/x86/include/asm/sighandling.h 
b/arch/x86/include/asm/sighandling.h
index e770c4fc47f4..8727c7e21dd1 100644
--- a/arch/x86/include/asm/sighandling.h
+++ b/arch/x86/include/asm/sighandling.h
@@ -24,4 +24,26 @@ int ia32_setup_rt_frame(struct ksignal *ksig, struct pt_regs 
*regs);
 int x64_setup_rt_frame(struct ksignal *ksig, struct pt_regs *regs);
 int x32_setup_rt_frame(struct ksignal *ksig, struct pt_regs *regs);
 
+/*
+ * To prevent immediate repeat of single step trap on return from SIGTRAP
+ * handler if the trap flag (TF) is set without an external debugger attached,
+ * clear the software event flag in the augmented SS, ensuring no single-step
+ * trap is pending upon ERETU completion.
+ *
+ * Note, this function should be called in sigreturn() before the original
+ * state is restored to make sure the TF is read from the entry frame.
+ */
+static __always_inline void prevent_single_step_upon_eretu(struct pt_regs 
*regs)
+{
+   /*
+* If the trap flag (TF) is set, i.e., the sigreturn() SYSCALL 
instruction
+* is being single-stepped, do not clear the software event flag in the
+* augmented SS, thus a debugger won't skip over the following 
instruction.
+*/
+#ifdef CONFIG_X86_FRED
+   if (!(regs->flags & X86_EFLAGS_TF))
+   regs->fred_ss.swevent = 0;
+#endif
+}
+
 #endif /* _ASM_X86_SIGHANDLING_H */
diff --git a/arch/x86/kernel/signal_32.c b/arch/x86/kernel/signal_32.c
index 98123ff10506..42bbc42bd350 100644
--- a/arch/x86/kernel/signal_32.c
+++ b/arch/x86/kernel/signal_32.c
@@ -152,6 +152,8 @@ SYSCALL32_DEFINE0(sigreturn)
struct sigframe_ia32 __user *frame = (struct sigframe_ia32 __user 
*)(regs->sp-8);
sigset_t set;
 
+   prevent_single_step_upon_eretu(regs);
+
if (!access_ok(frame, sizeof(*frame)))
goto badframe;
if (__get_user(set.sig[0], &frame->sc.oldmask)
@@ -175,6 +177,8 @@ SYSCALL32_DEFINE0(rt_sigreturn)
struct rt_sigframe_ia32 __user *frame;
sigset_t set;
 
+   prevent_single_step_upon_eretu(regs);
+
frame = (struct rt_sigframe_ia32 __user *)(regs->sp - 4);
 
if (!access_ok(frame, sizeof(*frame)))
diff --git a/arch/x86/kernel/signal_64.c b/arch/x86/kernel/signal_64.c
index ee9453891901..d483b585c6c6 100644
--- a/arch/x86/kernel/signal_64.c
+++ b/arch/x86/kernel/signal_64.c
@@ -250,6 +250,8 @@ SYSCALL_DEFINE0(rt_sigreturn)
sigset_t set;
unsigned long uc_flags;
 
+   prevent_single_step_upon_eretu(regs);
+
frame = (struct rt_sigframe __user *)(regs->sp - sizeof(long));
if (!access_ok(frame, sizeof(*frame)))
goto badframe;
@@ -366,6 +368,8 @@ COMPAT_SYSCALL_DEFINE0(x32_rt_sigreturn)
sigset_t set;
unsigned long uc_flags;
 
+   prevent_single_step_upon_eretu(regs);
+
fr

[PATCH v5 2/2] selftests/x86: Add a test to detect infinite sigtrap handler loop

2025-06-06 Thread Xin Li (Intel)
When FRED is enabled, if the Trap Flag (TF) is set without an external
debugger attached, it can lead to an infinite loop in the SIGTRAP
handler.  To avoid this, the software event flag in the augmented SS
must be cleared, ensuring that no single-step trap remains pending when
ERETU completes.

This test checks for that specific scenario—verifying whether the kernel
correctly prevents an infinite SIGTRAP loop in this edge case when FRED
is enabled.

The test should _always_ pass with IDT event delivery, thus no need to
disable the test even when FRED is not enabled.

Signed-off-by: Xin Li (Intel) 
Tested-by: Sohil Mehta 
Cc: sta...@vger.kernel.org
---

Changes in v5:
*) Do "sub $-128, %rsp" rather than "add $128, %rsp", which is more
   efficient in code size (hpa).
*) Add TB from Sohil.
*) Add Cc: sta...@vger.kernel.org.

Changes in v4:
*) merge this selftest with its bug fix patch set (Dave Hansen).
*) Address review comments from Sohil.
---
 tools/testing/selftests/x86/Makefile   |  2 +-
 tools/testing/selftests/x86/sigtrap_loop.c | 98 ++
 2 files changed, 99 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/x86/sigtrap_loop.c

diff --git a/tools/testing/selftests/x86/Makefile 
b/tools/testing/selftests/x86/Makefile
index f703fcfe9f7c..83148875a12c 100644
--- a/tools/testing/selftests/x86/Makefile
+++ b/tools/testing/selftests/x86/Makefile
@@ -12,7 +12,7 @@ CAN_BUILD_WITH_NOPIE := $(shell ./check_cc.sh "$(CC)" 
trivial_program.c -no-pie)
 
 TARGETS_C_BOTHBITS := single_step_syscall sysret_ss_attrs syscall_nt 
test_mremap_vdso \
check_initial_reg_state sigreturn iopl ioperm \
-   test_vsyscall mov_ss_trap \
+   test_vsyscall mov_ss_trap sigtrap_loop \
syscall_arg_fault fsgsbase_restore sigaltstack
 TARGETS_C_BOTHBITS += nx_stack
 TARGETS_C_32BIT_ONLY := entry_from_vm86 test_syscall_vdso unwind_vdso \
diff --git a/tools/testing/selftests/x86/sigtrap_loop.c 
b/tools/testing/selftests/x86/sigtrap_loop.c
new file mode 100644
index ..9eecf32c79c2
--- /dev/null
+++ b/tools/testing/selftests/x86/sigtrap_loop.c
@@ -0,0 +1,98 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2025 Intel Corporation
+ */
+#define _GNU_SOURCE
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#ifdef __x86_64__
+# define REG_IP REG_RIP
+#else
+# define REG_IP REG_EIP
+#endif
+
+static void sethandler(int sig, void (*handler)(int, siginfo_t *, void *), int 
flags)
+{
+   struct sigaction sa;
+
+   memset(&sa, 0, sizeof(sa));
+   sa.sa_sigaction = handler;
+   sa.sa_flags = SA_SIGINFO | flags;
+   sigemptyset(&sa.sa_mask);
+
+   if (sigaction(sig, &sa, 0))
+   err(1, "sigaction");
+
+   return;
+}
+
+static unsigned int loop_count_on_same_ip;
+
+static void sigtrap(int sig, siginfo_t *info, void *ctx_void)
+{
+   ucontext_t *ctx = (ucontext_t *)ctx_void;
+   static unsigned long last_trap_ip;
+
+   if (last_trap_ip == ctx->uc_mcontext.gregs[REG_IP]) {
+   printf("\tTrapped at %016lx\n", last_trap_ip);
+
+   /*
+* If the same IP is hit more than 10 times in a row, it is
+* _considered_ an infinite loop.
+*/
+   if (++loop_count_on_same_ip > 10) {
+   printf("[FAIL]\tDetected sigtrap infinite loop\n");
+   exit(1);
+   }
+
+   return;
+   }
+
+   loop_count_on_same_ip = 0;
+   last_trap_ip = ctx->uc_mcontext.gregs[REG_IP];
+   printf("\tTrapped at %016lx\n", last_trap_ip);
+}
+
+int main(int argc, char *argv[])
+{
+   sethandler(SIGTRAP, sigtrap, 0);
+
+   /*
+* Set the Trap Flag (TF) to single-step the test code, therefore to
+* trigger a SIGTRAP signal after each instruction until the TF is
+* cleared.
+*
+* Because the arithmetic flags are not significant here, the TF is
+* set by pushing 0x302 onto the stack and then popping it into the
+* flags register.
+*
+* Four instructions in the following asm code are executed with the
+* TF set, thus the SIGTRAP handler is expected to run four times.
+*/
+   printf("[RUN]\tsigtrap infinite loop detection\n");
+   asm volatile(
+#ifdef __x86_64__
+   /* Avoid clobbering the redzone */
+   "sub $128, %rsp\n\t"
+#endif
+   "push $0x302\n\t"
+   "popf\n\t"
+   "nop\n\t"
+   "nop\n\t"
+   "push $0x202\n\t"
+   "popf\n\t"
+#ifdef __x86_64__
+   /* Equivalent to "add $128, %rsp" with 3 fewer bytes in 
encoding */
+   "sub $-128, %rsp\n\t"
+#endif
+   );
+
+   printf("[OK]\tNo sigtrap infinite loop detected\n");
+   return 0;
+}
-- 
2.49.0




[PATCH v6 1/2] x86/fred/signal: Prevent immediate repeat of single step trap on return from SIGTRAP handler

2025-06-09 Thread Xin Li (Intel)
Clear the software event flag in the augmented SS to prevent immediate
repeat of single step trap on return from SIGTRAP handler if the trap
flag (TF) is set without an external debugger attached.

Following is a typical single-stepping flow for a user process:

1) The user process is prepared for single-stepping by setting
   RFLAGS.TF = 1.
2) When any instruction in user space completes, a #DB is triggered.
3) The kernel handles the #DB and returns to user space, invoking the
   SIGTRAP handler with RFLAGS.TF = 0.
4) After the SIGTRAP handler finishes, the user process performs a
   sigreturn syscall, restoring the original state, including
   RFLAGS.TF = 1.
5) Goto step 2.

According to the FRED specification:

A) Bit 17 in the augmented SS is designated as the software event
   flag, which is set to 1 for FRED event delivery of SYSCALL,
   SYSENTER, or INT n.
B) If bit 17 of the augmented SS is 1 and ERETU would result in
   RFLAGS.TF = 1, a single-step trap will be pending upon completion
   of ERETU.

In step 4) above, the software event flag is set upon the sigreturn
syscall, and its corresponding ERETU would restore RFLAGS.TF = 1.
This combination causes a pending single-step trap upon completion of
ERETU.  Therefore, another #DB is triggered before any user space
instruction is executed, which leads to an infinite loop in which the
SIGTRAP handler keeps being invoked on the same user space IP.

Suggested-by: H. Peter Anvin (Intel) 
Tested-by: Sohil Mehta 
Signed-off-by: Xin Li (Intel) 
Cc: sta...@vger.kernel.org
---

Change in v6:
*) Add TB from Sohil.

Change in v5:
*) Accurately rephrase the shortlog (hpa).

Change in v4:
*) Add a selftest to the patch set (Dave Hansen).

Change in v3:
*) Use "#ifdef CONFIG_X86_FRED" instead of IS_ENABLED(CONFIG_X86_FRED)
   (Intel LKP).

Change in v2:
*) Remove the check cpu_feature_enabled(X86_FEATURE_FRED), because
   regs->fred_ss.swevent will always be 0 otherwise (hpa).
---
 arch/x86/include/asm/sighandling.h | 22 ++
 arch/x86/kernel/signal_32.c|  4 
 arch/x86/kernel/signal_64.c|  4 
 3 files changed, 30 insertions(+)

diff --git a/arch/x86/include/asm/sighandling.h 
b/arch/x86/include/asm/sighandling.h
index e770c4fc47f4..8727c7e21dd1 100644
--- a/arch/x86/include/asm/sighandling.h
+++ b/arch/x86/include/asm/sighandling.h
@@ -24,4 +24,26 @@ int ia32_setup_rt_frame(struct ksignal *ksig, struct pt_regs 
*regs);
 int x64_setup_rt_frame(struct ksignal *ksig, struct pt_regs *regs);
 int x32_setup_rt_frame(struct ksignal *ksig, struct pt_regs *regs);
 
+/*
+ * To prevent immediate repeat of single step trap on return from SIGTRAP
+ * handler if the trap flag (TF) is set without an external debugger attached,
+ * clear the software event flag in the augmented SS, ensuring no single-step
+ * trap is pending upon ERETU completion.
+ *
+ * Note, this function should be called in sigreturn() before the original
+ * state is restored to make sure the TF is read from the entry frame.
+ */
+static __always_inline void prevent_single_step_upon_eretu(struct pt_regs 
*regs)
+{
+   /*
+* If the trap flag (TF) is set, i.e., the sigreturn() SYSCALL 
instruction
+* is being single-stepped, do not clear the software event flag in the
+* augmented SS, thus a debugger won't skip over the following 
instruction.
+*/
+#ifdef CONFIG_X86_FRED
+   if (!(regs->flags & X86_EFLAGS_TF))
+   regs->fred_ss.swevent = 0;
+#endif
+}
+
 #endif /* _ASM_X86_SIGHANDLING_H */
diff --git a/arch/x86/kernel/signal_32.c b/arch/x86/kernel/signal_32.c
index 98123ff10506..42bbc42bd350 100644
--- a/arch/x86/kernel/signal_32.c
+++ b/arch/x86/kernel/signal_32.c
@@ -152,6 +152,8 @@ SYSCALL32_DEFINE0(sigreturn)
struct sigframe_ia32 __user *frame = (struct sigframe_ia32 __user 
*)(regs->sp-8);
sigset_t set;
 
+   prevent_single_step_upon_eretu(regs);
+
if (!access_ok(frame, sizeof(*frame)))
goto badframe;
if (__get_user(set.sig[0], &frame->sc.oldmask)
@@ -175,6 +177,8 @@ SYSCALL32_DEFINE0(rt_sigreturn)
struct rt_sigframe_ia32 __user *frame;
sigset_t set;
 
+   prevent_single_step_upon_eretu(regs);
+
frame = (struct rt_sigframe_ia32 __user *)(regs->sp - 4);
 
if (!access_ok(frame, sizeof(*frame)))
diff --git a/arch/x86/kernel/signal_64.c b/arch/x86/kernel/signal_64.c
index ee9453891901..d483b585c6c6 100644
--- a/arch/x86/kernel/signal_64.c
+++ b/arch/x86/kernel/signal_64.c
@@ -250,6 +250,8 @@ SYSCALL_DEFINE0(rt_sigreturn)
sigset_t set;
unsigned long uc_flags;
 
+   prevent_single_step_upon_eretu(regs);
+
frame = (struct rt_sigframe __user *)(regs->sp - sizeof(long));
if (!access_ok(frame, sizeof(*frame)))
goto badframe;
@@ -366,6 +368,8 @@ COMPAT_SYSCALL_DEFINE0(x32_rt_sigreturn)
sigset_t set;
unsigned long uc_flags;
 

[PATCH v6 2/2] selftests/x86: Add a test to detect infinite SIGTRAP handler loop

2025-06-09 Thread Xin Li (Intel)
When FRED is enabled, if the Trap Flag (TF) is set without an external
debugger attached, it can lead to an infinite loop in the SIGTRAP
handler.  To avoid this, the software event flag in the augmented SS
must be cleared, ensuring that no single-step trap remains pending when
ERETU completes.

This test checks for that specific scenario—verifying whether the kernel
correctly prevents an infinite SIGTRAP loop in this edge case when FRED
is enabled.

The test should _always_ pass with IDT event delivery, thus no need to
disable the test even when FRED is not enabled.

Tested-by: Sohil Mehta 
Signed-off-by: Xin Li (Intel) 
Cc: sta...@vger.kernel.org
---

Changes in v6:
*) Replace a "sub $128, %rsp" with "add $-128, %rsp" (hpa).
*) Declared loop_count_on_same_ip inside sigtrap() (Sohil).
*) s/sigtrap/SIGTRAP (Sohil).

Changes in v5:
*) Do "sub $-128, %rsp" rather than "add $128, %rsp", which is more
   efficient in code size (hpa).
*) Add TB from Sohil.
*) Add Cc: sta...@vger.kernel.org.

Changes in v4:
*) merge this selftest with its bug fix patch set (Dave Hansen).
*) Address review comments from Sohil.
---
 tools/testing/selftests/x86/Makefile   |   2 +-
 tools/testing/selftests/x86/sigtrap_loop.c | 101 +
 2 files changed, 102 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/x86/sigtrap_loop.c

diff --git a/tools/testing/selftests/x86/Makefile 
b/tools/testing/selftests/x86/Makefile
index f703fcfe9f7c..83148875a12c 100644
--- a/tools/testing/selftests/x86/Makefile
+++ b/tools/testing/selftests/x86/Makefile
@@ -12,7 +12,7 @@ CAN_BUILD_WITH_NOPIE := $(shell ./check_cc.sh "$(CC)" 
trivial_program.c -no-pie)
 
 TARGETS_C_BOTHBITS := single_step_syscall sysret_ss_attrs syscall_nt 
test_mremap_vdso \
check_initial_reg_state sigreturn iopl ioperm \
-   test_vsyscall mov_ss_trap \
+   test_vsyscall mov_ss_trap sigtrap_loop \
syscall_arg_fault fsgsbase_restore sigaltstack
 TARGETS_C_BOTHBITS += nx_stack
 TARGETS_C_32BIT_ONLY := entry_from_vm86 test_syscall_vdso unwind_vdso \
diff --git a/tools/testing/selftests/x86/sigtrap_loop.c 
b/tools/testing/selftests/x86/sigtrap_loop.c
new file mode 100644
index ..9d065479e89f
--- /dev/null
+++ b/tools/testing/selftests/x86/sigtrap_loop.c
@@ -0,0 +1,101 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2025 Intel Corporation
+ */
+#define _GNU_SOURCE
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#ifdef __x86_64__
+# define REG_IP REG_RIP
+#else
+# define REG_IP REG_EIP
+#endif
+
+static void sethandler(int sig, void (*handler)(int, siginfo_t *, void *), int 
flags)
+{
+   struct sigaction sa;
+
+   memset(&sa, 0, sizeof(sa));
+   sa.sa_sigaction = handler;
+   sa.sa_flags = SA_SIGINFO | flags;
+   sigemptyset(&sa.sa_mask);
+
+   if (sigaction(sig, &sa, 0))
+   err(1, "sigaction");
+
+   return;
+}
+
+static void sigtrap(int sig, siginfo_t *info, void *ctx_void)
+{
+   ucontext_t *ctx = (ucontext_t *)ctx_void;
+   static unsigned int loop_count_on_same_ip;
+   static unsigned long last_trap_ip;
+
+   if (last_trap_ip == ctx->uc_mcontext.gregs[REG_IP]) {
+   printf("\tTrapped at %016lx\n", last_trap_ip);
+
+   /*
+* If the same IP is hit more than 10 times in a row, it is
+* _considered_ an infinite loop.
+*/
+   if (++loop_count_on_same_ip > 10) {
+   printf("[FAIL]\tDetected SIGTRAP infinite loop\n");
+   exit(1);
+   }
+
+   return;
+   }
+
+   loop_count_on_same_ip = 0;
+   last_trap_ip = ctx->uc_mcontext.gregs[REG_IP];
+   printf("\tTrapped at %016lx\n", last_trap_ip);
+}
+
+int main(int argc, char *argv[])
+{
+   sethandler(SIGTRAP, sigtrap, 0);
+
+   /*
+* Set the Trap Flag (TF) to single-step the test code, therefore to
+* trigger a SIGTRAP signal after each instruction until the TF is
+* cleared.
+*
+* Because the arithmetic flags are not significant here, the TF is
+* set by pushing 0x302 onto the stack and then popping it into the
+* flags register.
+*
+* Four instructions in the following asm code are executed with the
+* TF set, thus the SIGTRAP handler is expected to run four times.
+*/
+   printf("[RUN]\tSIGTRAP infinite loop detection\n");
+   asm volatile(
+#ifdef __x86_64__
+   /*
+* Avoid clobbering the redzone
+*
+* Equivalent to "sub $128, %rsp", however -128 can be encoded
+* in a single byte immediate while 128 uses 4 bytes.
+*/

[PATCH v6 0/2] x86/fred: Prevent immediate repeat of single step trap on return from SIGTRAP handler

2025-06-09 Thread Xin Li (Intel)
IDT event delivery has a debug hole in which it does not generate #DB
upon returning to userspace before the first userspace instruction is
executed if the Trap Flag (TF) is set.

FRED closes this hole by introducing a software event flag, i.e., bit
17 of the augmented SS: if the bit is set and ERETU would result in
RFLAGS.TF = 1, a single-step trap will be pending upon completion of
ERETU.

However I overlooked properly setting and clearing the bit in different
situations.  Thus when FRED is enabled, if the Trap Flag (TF) is set
without an external debugger attached, it can lead to an infinite loop
in the SIGTRAP handler.  To avoid this, the software event flag in the
augmented SS must be cleared, ensuring that no single-step trap remains
pending when ERETU completes.

This patch set combines the fix [1] and its corresponding selftest [2]
(requested by Dave Hansen) into one patch set.

[1] https://lore.kernel.org/lkml/20250523050153.3308237-1-...@zytor.com/
[2] https://lore.kernel.org/lkml/20250530230707.2528916-1-...@zytor.com/


This patch set is based on tip/x86/urgent branch.


Link to v5 of this patch set:
https://lore.kernel.org/lkml/20250606174528.1004756-1-...@zytor.com/


Changes in v6:
*) Replace a "sub $128, %rsp" with "add $-128, %rsp" (hpa).
*) Declared loop_count_on_same_ip inside sigtrap() (Sohil).
*) s/sigtrap/SIGTRAP (Sohil).
*) Add TB from Sohil to the first patch.


Xin Li (Intel) (2):
  x86/fred/signal: Prevent immediate repeat of single step trap on
return from SIGTRAP handler
  selftests/x86: Add a test to detect infinite SIGTRAP handler loop

 arch/x86/include/asm/sighandling.h |  22 +
 arch/x86/kernel/signal_32.c|   4 +
 arch/x86/kernel/signal_64.c|   4 +
 tools/testing/selftests/x86/Makefile   |   2 +-
 tools/testing/selftests/x86/sigtrap_loop.c | 101 +
 5 files changed, 132 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/x86/sigtrap_loop.c


base-commit: dd2922dcfaa3296846265e113309e5f7f138839f
-- 
2.49.0




[PATCH v1 1/1] selftests/x86: Add a test to detect infinite sigtrap handler loop

2025-05-30 Thread Xin Li (Intel)
When FRED is enabled, if the Trap Flag (TF) is set without an external
debugger attached, it can lead to an infinite loop in the SIGTRAP
handler.  To avoid this, the software event flag in the augmented SS
must be cleared, ensuring that no single-step trap remains pending when
ERETU completes.

This test checks for that specific scenario—verifying whether the kernel
correctly prevents an infinite SIGTRAP loop in this edge case.

Signed-off-by: Xin Li (Intel) 
---
 tools/testing/selftests/x86/Makefile  |  2 +-
 .../selftests/x86/test_sigtrap_handler.c  | 80 +++
 2 files changed, 81 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/x86/test_sigtrap_handler.c

diff --git a/tools/testing/selftests/x86/Makefile 
b/tools/testing/selftests/x86/Makefile
index f703fcfe9f7c..c486fd88ebb1 100644
--- a/tools/testing/selftests/x86/Makefile
+++ b/tools/testing/selftests/x86/Makefile
@@ -12,7 +12,7 @@ CAN_BUILD_WITH_NOPIE := $(shell ./check_cc.sh "$(CC)" 
trivial_program.c -no-pie)
 
 TARGETS_C_BOTHBITS := single_step_syscall sysret_ss_attrs syscall_nt 
test_mremap_vdso \
check_initial_reg_state sigreturn iopl ioperm \
-   test_vsyscall mov_ss_trap \
+   test_vsyscall mov_ss_trap test_sigtrap_handler \
syscall_arg_fault fsgsbase_restore sigaltstack
 TARGETS_C_BOTHBITS += nx_stack
 TARGETS_C_32BIT_ONLY := entry_from_vm86 test_syscall_vdso unwind_vdso \
diff --git a/tools/testing/selftests/x86/test_sigtrap_handler.c 
b/tools/testing/selftests/x86/test_sigtrap_handler.c
new file mode 100644
index ..9c5c2cf0cf88
--- /dev/null
+++ b/tools/testing/selftests/x86/test_sigtrap_handler.c
@@ -0,0 +1,80 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ *  Copyright (C) 2025 Intel Corporation
+ */
+#define _GNU_SOURCE
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#ifdef __x86_64__
+# define REG_IP REG_RIP
+#else
+# define REG_IP REG_EIP
+#endif
+
+static void sethandler(int sig, void (*handler)(int, siginfo_t *, void *), int 
flags)
+{
+   struct sigaction sa;
+
+   memset(&sa, 0, sizeof(sa));
+   sa.sa_sigaction = handler;
+   sa.sa_flags = SA_SIGINFO | flags;
+   sigemptyset(&sa.sa_mask);
+
+   if (sigaction(sig, &sa, 0))
+   err(1, "sigaction");
+
+   return;
+}
+
+static unsigned int loop_count_on_same_ip;
+
+static void sigtrap(int sig, siginfo_t *info, void *ctx_void)
+{
+   ucontext_t *ctx = (ucontext_t *)ctx_void;
+   static unsigned long last_trap_ip;
+
+   if (last_trap_ip == ctx->uc_mcontext.gregs[REG_IP]) {
+   printf("trapped on %016lx\n", last_trap_ip);
+
+   if (++loop_count_on_same_ip > 10) {
+   printf("trap loop detected, test failed\n");
+   exit(2);
+   }
+
+   return;
+   }
+
+   loop_count_on_same_ip = 0;
+   last_trap_ip = ctx->uc_mcontext.gregs[REG_IP];
+   printf("trapped on %016lx\n", last_trap_ip);
+}
+
+int main(int argc, char *argv[])
+{
+   sethandler(SIGTRAP, sigtrap, 0);
+
+   asm volatile(
+#ifdef __x86_64__
+   /* Avoid clobbering the redzone */
+   "sub $128, %rsp\n\t"
+#endif
+   "push $0x302\n\t"
+   "popf\n\t"
+   "nop\n\t"
+   "nop\n\t"
+   "push $0x202\n\t"
+   "popf\n\t"
+#ifdef __x86_64__
+   "add $128, %rsp\n\t"
+#endif
+   );
+
+   printf("test passed\n");
+   return 0;
+}

base-commit: 485d11d84a2452ac16466cc7ae041c93d38929bc
-- 
2.49.0




[PATCH v3 6/7] x86/tls,ptrace: provide regset access to the GDT

2018-06-21 Thread H. Peter Anvin, Intel
From: "H. Peter Anvin" 

Provide access to the user-visible part of the GDT via a regset in
ptrace().  Note that we already provide a regset for the TLS area part
of the GDT; these can trivially be unified by looking at the contents
of the regset structure, especially since the TLS area is the only
user-modifiable part of the GDT.

Signed-off-by: H. Peter Anvin (Intel) 
Cc: Ingo Molnar 
Cc: Thomas Gleixner 
Cc: Andy Lutomirski 
Cc: Chang S. Bae 
Cc: Markus T. Metzger 
---
 arch/x86/kernel/ptrace.c | 83 ++--
 arch/x86/kernel/tls.c| 83 ++--
 arch/x86/kernel/tls.h|  8 +++--
 include/uapi/linux/elf.h |  1 +
 4 files changed, 118 insertions(+), 57 deletions(-)

diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index e2ee403865eb..5ce10310f440 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -50,6 +50,7 @@ enum x86_regset {
REGSET_XSTATE,
REGSET_TLS,
REGSET_IOPERM32,
+   REGSET_GDT
 };
 
 struct pt_regs_offset {
@@ -747,6 +748,60 @@ static int ioperm_get(struct task_struct *target,
   0, IO_BITMAP_BYTES);
 }
 
+/*
+ * These provide read access to the GDT.  As the only part that is
+ * writable is the TLS area, that code is in tls.c.
+ */
+static int gdt_get(struct task_struct *target,
+  const struct user_regset *regset,
+  unsigned int pos, unsigned int count,
+  void *kbuf, void __user *ubuf)
+{
+   struct desc_struct gdt_copy[GDT_LAST_USER + 1];
+   const struct desc_struct *p;
+   struct user_desc udesc;
+   unsigned int index, endindex;
+   int err;
+
+   if (pos % sizeof(struct user_desc))
+   return -EINVAL;
+
+   /* Get a snapshot of the GDT from an arbitrary CPU */
+   memcpy(gdt_copy, get_current_gdt_ro(), sizeof(gdt_copy));
+
+   /* Copy over the TLS area */
+   memcpy(&gdt_copy[GDT_ENTRY_TLS_MIN], target->thread.tls_array,
+  sizeof(target->thread.tls_array));
+
+   /* Descriptor zero is never accessible */
+   memset(&gdt_copy[0], 0, sizeof(gdt_copy[0]));
+
+   index = pos/sizeof(struct user_desc);
+   endindex = index + count/sizeof(struct user_desc);
+   endindex = min_t(unsigned int, GDT_LAST_USER + 1 - regset->bias,
+   endindex);
+
+   p = &gdt_copy[index + regset->bias];
+
+   while (count && index < endindex) {
+   fill_user_desc(&udesc, index++, p++);
+   err = user_regset_copyout(&pos, &count, &kbuf, &ubuf, &udesc,
+pos, pos + sizeof(udesc));
+   if (err)
+   return err;
+   }
+
+   /* Return zero for the rest of the regset, if applicable. */
+   return user_regset_copyout_zero(&pos, &count, &kbuf, &ubuf, 0, -1);
+}
+
+static int gdt_active(struct task_struct  *target,
+ const struct user_regset *regset)
+{
+   (void)target;
+   return GDT_LAST_USER + 1;
+}
+
 /*
  * Called by kernel/ptrace.c when detaching..
  *
@@ -1262,7 +1317,8 @@ static struct user_regset x86_64_regsets[] 
__ro_after_init = {
.core_note_type = NT_PRFPREG,
.n = sizeof(struct user_i387_struct) / sizeof(long),
.size = sizeof(long), .align = sizeof(long),
-   .active = regset_xregset_fpregs_active, .get = xfpregs_get, 
.set = xfpregs_set
+   .active = regset_xregset_fpregs_active, .get = xfpregs_get,
+   .set = xfpregs_set
},
[REGSET_XSTATE] = {
.core_note_type = NT_X86_XSTATE,
@@ -1276,6 +1332,14 @@ static struct user_regset x86_64_regsets[] 
__ro_after_init = {
.size = sizeof(long), .align = sizeof(long),
.active = ioperm_active, .get = ioperm_get
},
+   [REGSET_GDT] = {
+   .core_note_type = NT_X86_GDT,
+   .n = LDT_ENTRIES, /* Theoretical maximum */
+   .size = sizeof(struct user_desc),
+   .align = sizeof(struct user_desc),
+   .active = gdt_active, .get = gdt_get,
+   .set = regset_gdt_set
+   },
 };
 
 static const struct user_regset_view user_x86_64_view = {
@@ -1309,7 +1373,8 @@ static struct user_regset x86_32_regsets[] 
__ro_after_init = {
.core_note_type = NT_PRXFPREG,
.n = sizeof(struct user32_fxsr_struct) / sizeof(u32),
.size = sizeof(u32), .align = sizeof(u32),
-   .active = regset_xregset_fpregs_active, .get = xfpregs_get, 
.set = xfpregs_set
+   .active = regset_xregset_fpregs_active, .get = xfpregs_get,
+   .set = xfpregs_set
},
[REGSET_XSTATE] = {
.core_note_type = NT_X86_XSTA

[PATCH v3 4/7] x86/tls: create an explicit config symbol for the TLS area in the GDT

2018-06-21 Thread H. Peter Anvin, Intel
From: "H. Peter Anvin" 

Instead of using X86_32 || IA32_EMULATION, which is really rather ugly
in the Makefile especially, create a dedicated config symbol for the
TLS area.  This will be further used in subsequent patches.

Signed-off-by: H. Peter Anvin (Intel) 
Cc: Ingo Molnar 
Cc: Thomas Gleixner 
Cc: Andy Lutomirski 
Cc: Chang S. Bae 
Cc: Markus T. Metzger 
---
 arch/x86/Kconfig | 4 
 arch/x86/kernel/Makefile | 3 +--
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f1dbb4ee19d7..e48036e3f158 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2937,6 +2937,10 @@ config X86_DMA_REMAP
 config HAVE_GENERIC_GUP
def_bool y
 
+config X86_TLS_AREA
+   def_bool y
+   depends on IA32_EMULATION || X86_32
+
 source "net/Kconfig"
 
 source "drivers/Kconfig"
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 02d6f5cf4e70..d26ef8a40a88 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -65,8 +65,7 @@ obj-y += resource.o
 obj-y  += process.o
 obj-y  += fpu/
 obj-y  += ptrace.o
-obj-$(CONFIG_X86_32)   += tls.o
-obj-$(CONFIG_IA32_EMULATION)   += tls.o
+obj-$(CONFIG_X86_TLS_AREA) += tls.o
 obj-y  += step.o
 obj-$(CONFIG_INTEL_TXT)+= tboot.o
 obj-$(CONFIG_ISA_DMA_API)  += i8237.o
-- 
2.14.4



[PATCH v3 5/7] x86/segment: add #define for the last user-visible GDT slot

2018-06-21 Thread H. Peter Anvin, Intel
From: "H. Peter Anvin" 

We don't want to advertise to user space how many slots the kernel
GDT has, but user space can trivially find out what the last
user-accessible GDT slot is.  Add a #define for that so we can use
that in sizing a regset.

Signed-off-by: H. Peter Anvin (Intel) 
Cc: Ingo Molnar 
Cc: Thomas Gleixner 
Cc: Andy Lutomirski 
Cc: Chang S. Bae 
Cc: Markus T. Metzger 
---
 arch/x86/include/asm/segment.h | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/include/asm/segment.h b/arch/x86/include/asm/segment.h
index e293c122d0d5..5eb809eec048 100644
--- a/arch/x86/include/asm/segment.h
+++ b/arch/x86/include/asm/segment.h
@@ -115,6 +115,11 @@
  */
 #define GDT_ENTRIES32
 
+/*
+ * Last user-visible GDT slot
+ */
+#define GDT_LAST_USER  GDT_ENTRY_DEFAULT_USER_DS
+
 /*
  * Segment selector values corresponding to the above entries:
  */
@@ -194,6 +199,11 @@
  */
 #define GDT_ENTRIES16
 
+/*
+ * Last user-visible GDT slot
+ */
+#define GDT_LAST_USER  GDT_ENTRY_PER_CPU
+
 /*
  * Segment selector values corresponding to the above entries:
  *
-- 
2.14.4



[PATCH v3 3/7] x86: move fill_user_desc() from tls.c to desc.h and add validity check

2018-06-21 Thread H. Peter Anvin, Intel
From: "H. Peter Anvin" 

This is generic code which is potentially useful in other contexts.
Unfortunately modify_ldt() is kind of stupid in that it returns a
descriptor in CPU format but takes a different format, but regsets
*have* to operate differently.

Signed-off-by: H. Peter Anvin (Intel) 
Cc: Ingo Molnar 
Cc: Thomas Gleixner 
Cc: Andy Lutomirski 
Cc: Chang S. Bae 
Cc: Markus T. Metzger 
---
 arch/x86/include/asm/desc.h | 22 ++
 arch/x86/kernel/tls.c   | 19 ---
 2 files changed, 22 insertions(+), 19 deletions(-)

diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
index 13c5ee878a47..5e69f993c9ff 100644
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -386,6 +386,28 @@ static inline void set_desc_limit(struct desc_struct 
*desc, unsigned long limit)
desc->limit1 = (limit >> 16) & 0xf;
 }
 
+
+static inline void fill_user_desc(struct user_desc *info, int idx,
+  const struct desc_struct *desc)
+
+{
+   memset(info, 0, sizeof(*info));
+   info->entry_number = idx;
+   if (desc->s && desc->dpl == 3) {
+   info->base_addr = get_desc_base(desc);
+   info->limit = get_desc_limit(desc);
+   info->seg_32bit = desc->d;
+   info->contents = desc->type >> 2;
+   info->read_exec_only = !(desc->type & 2);
+   info->limit_in_pages = desc->g;
+   info->seg_not_present = !desc->p;
+   info->useable = desc->avl;
+#ifdef CONFIG_X86_64
+   info->lm = desc->l;
+#endif
+   }
+}
+
 void update_intr_gate(unsigned int n, const void *addr);
 void alloc_intr_gate(unsigned int n, const void *addr);
 
diff --git a/arch/x86/kernel/tls.c b/arch/x86/kernel/tls.c
index a5b802a12212..7b8ecb760707 100644
--- a/arch/x86/kernel/tls.c
+++ b/arch/x86/kernel/tls.c
@@ -197,25 +197,6 @@ SYSCALL_DEFINE1(set_thread_area, struct user_desc __user 
*, u_info)
  * Get the current Thread-Local Storage area:
  */
 
-static void fill_user_desc(struct user_desc *info, int idx,
-  const struct desc_struct *desc)
-
-{
-   memset(info, 0, sizeof(*info));
-   info->entry_number = idx;
-   info->base_addr = get_desc_base(desc);
-   info->limit = get_desc_limit(desc);
-   info->seg_32bit = desc->d;
-   info->contents = desc->type >> 2;
-   info->read_exec_only = !(desc->type & 2);
-   info->limit_in_pages = desc->g;
-   info->seg_not_present = !desc->p;
-   info->useable = desc->avl;
-#ifdef CONFIG_X86_64
-   info->lm = desc->l;
-#endif
-}
-
 int do_get_thread_area(struct task_struct *p, int idx,
   struct user_desc __user *u_info)
 {
-- 
2.14.4



[PATCH v3 0/7] x86/ptrace: regset access to the GDT and LDT

2018-06-21 Thread H. Peter Anvin, Intel
From: "H. Peter Anvin" 

Give a debugger access to the visible part of the GDT and LDT.  This
allows a debugger to find out what a particular segment descriptor
corresponds to; e.g. if %cs is 16, 32, or 64 bits.

v3:
Requalify LDT segments for selectors that have actually changed.

v2:
Add missing change to elf.h for the very last patch.

Cc: Ingo Molnar 
Cc: Thomas Gleixner 
Cc: Andy Lutomirski 
Cc: Chang S. Bae 
Cc: Markus T. Metzger 
Cc: H. Peter Anvin 

 arch/x86/Kconfig   |   4 +
 arch/x86/include/asm/desc.h|  24 +++-
 arch/x86/include/asm/ldt.h |  16 +++
 arch/x86/include/asm/segment.h |  10 ++
 arch/x86/kernel/Makefile   |   3 +-
 arch/x86/kernel/ldt.c  | 283 -
 arch/x86/kernel/ptrace.c   | 103 ++-
 arch/x86/kernel/tls.c  | 102 +--
 arch/x86/kernel/tls.h  |   8 +-
 include/uapi/linux/elf.h   |   2 +
 10 files changed, 413 insertions(+), 142 deletions(-)


[PATCH v3 7/7] x86/ldt,ptrace: provide regset access to the LDT

2018-06-21 Thread H. Peter Anvin, Intel
From: "H. Peter Anvin" 

Provide ptrace/regset access to the LDT, if one exists.  This
interface provides both read and write access. The write code is
unified with modify_ldt(); the read code doesn't have enough
similarity so it has been kept made separate.

Signed-off-by: H. Peter Anvin (Intel) 
Cc: Ingo Molnar 
Cc: Thomas Gleixner 
Cc: Andy Lutomirski 
Cc: Chang S. Bae 
Cc: Markus T. Metzger 
---
 arch/x86/include/asm/desc.h |   2 +-
 arch/x86/include/asm/ldt.h  |  16 
 arch/x86/kernel/ldt.c   | 209 +++-
 arch/x86/kernel/ptrace.c|  20 +
 include/uapi/linux/elf.h|   1 +
 5 files changed, 204 insertions(+), 44 deletions(-)
 create mode 100644 arch/x86/include/asm/ldt.h

diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
index 5e69f993c9ff..12a759d76314 100644
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -3,7 +3,7 @@
 #define _ASM_X86_DESC_H
 
 #include 
-#include 
+#include 
 #include 
 #include 
 #include 
diff --git a/arch/x86/include/asm/ldt.h b/arch/x86/include/asm/ldt.h
new file mode 100644
index ..302ec2d6d45d
--- /dev/null
+++ b/arch/x86/include/asm/ldt.h
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Ptrace interface to the LDT
+ *
+ */
+
+#ifndef _ARCH_X86_INCLUDE_ASM_LDT_H
+
+#include 
+#include 
+
+extern user_regset_active_fn regset_ldt_active;
+extern user_regset_get_fn regset_ldt_get;
+extern user_regset_set_fn regset_ldt_set;
+
+#endif /* _ARCH_X86_INCLUDE_ASM_LDT_H */
diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index 601d24268a99..e80dfde1f82f 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -392,53 +392,39 @@ static int read_default_ldt(void __user *ptr, unsigned 
long bytecount)
return bytecount;
 }
 
-static int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode)
+
+static int do_write_ldt(const struct task_struct *target,
+   const void *kbuf, const void __user *ubuf,
+   unsigned long bytecount, unsigned int index,
+   bool oldmode)
 {
-   struct mm_struct *mm = current->mm;
+   struct mm_struct *mm = target->mm;
struct ldt_struct *new_ldt, *old_ldt;
-   unsigned int old_nr_entries, new_nr_entries;
+   unsigned int count, old_nr_entries, new_nr_entries;
struct user_desc ldt_info;
+   const struct user_desc *kptr = kbuf;
+   const struct user_desc __user *uptr = ubuf;
struct desc_struct ldt;
+   unsigned short first_index, last_index;
int error;
 
-   error = -EINVAL;
-   if (bytecount != sizeof(ldt_info))
-   goto out;
-   error = -EFAULT;
-   if (copy_from_user(&ldt_info, ptr, sizeof(ldt_info)))
-   goto out;
+   if (bytecount % sizeof(ldt_info) ||
+   bytecount >= LDT_ENTRY_SIZE*LDT_ENTRIES)
+   return -EINVAL;
 
-   error = -EINVAL;
-   if (ldt_info.entry_number >= LDT_ENTRIES)
-   goto out;
-   if (ldt_info.contents == 3) {
-   if (oldmode)
-   goto out;
-   if (ldt_info.seg_not_present == 0)
-   goto out;
-   }
+   count = bytecount/sizeof(ldt_info);
+   if (index >= LDT_ENTRIES || index + count > LDT_ENTRIES)
+   return -EINVAL;
 
-   if ((oldmode && !ldt_info.base_addr && !ldt_info.limit) ||
-   LDT_empty(&ldt_info)) {
-   /* The user wants to clear the entry. */
-   memset(&ldt, 0, sizeof(ldt));
-   } else {
-   if (!IS_ENABLED(CONFIG_X86_16BIT) && !ldt_info.seg_32bit) {
-   error = -EINVAL;
-   goto out;
-   }
-
-   fill_ldt(&ldt, &ldt_info);
-   if (oldmode)
-   ldt.avl = 0;
-   }
+   first_index = index;
+   last_index = index + count - 1;
 
if (down_write_killable(&mm->context.ldt_usr_sem))
return -EINTR;
 
-   old_ldt   = mm->context.ldt;
+   old_ldt= mm->context.ldt;
old_nr_entries = old_ldt ? old_ldt->nr_entries : 0;
-   new_nr_entries = max(ldt_info.entry_number + 1, old_nr_entries);
+   new_nr_entries = max(index + count, old_nr_entries);
 
error = -ENOMEM;
new_ldt = alloc_ldt_struct(new_nr_entries);
@@ -446,9 +432,46 @@ static int write_ldt(void __user *ptr, unsigned long 
bytecount, int oldmode)
goto out_unlock;
 
if (old_ldt)
-   memcpy(new_ldt->entries, old_ldt->entries, old_nr_entries * 
LDT_ENTRY_SIZE);
+   memcpy(new_ldt->entries, old_ldt->entries,
+  old_nr_entries * LDT_ENTRY_SIZE);
+
+   while (count--) {
+   error = -EFAULT;
+ 

[PATCH v3 2/7] x86/ldt: use a common value for read_default_ldt()

2018-06-21 Thread H. Peter Anvin, Intel
From: "H. Peter Anvin" 

There is no point in having two different sizes for the "default ldt";
a concept which is obsolete anyway.  Since this is kernel-dependent
and not user-space dependent, a 32-bit app needs to be able to accept
the 64-bit value anyway, so use that value, which is the larger of the
two.

Signed-off-by: H. Peter Anvin (Intel) 
Cc: Ingo Molnar 
Cc: Thomas Gleixner 
Cc: Andy Lutomirski 
Cc: Chang S. Bae 
Cc: Markus T. Metzger 
---
 arch/x86/kernel/ldt.c | 8 ++--
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index 18e9f4c0633d..601d24268a99 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -383,12 +383,8 @@ static int read_ldt(void __user *ptr, unsigned long 
bytecount)
 
 static int read_default_ldt(void __user *ptr, unsigned long bytecount)
 {
-   /* CHECKME: Can we use _one_ random number ? */
-#ifdef CONFIG_X86_32
-   unsigned long size = 5 * sizeof(struct desc_struct);
-#else
-   unsigned long size = 128;
-#endif
+   const unsigned long size = 128;
+
if (bytecount > size)
bytecount = size;
if (clear_user(ptr, bytecount))
-- 
2.14.4



[PATCH v3 1/7] x86/ldt: refresh %fs and %gs in refresh_ldt_segments()

2018-06-21 Thread H. Peter Anvin, Intel
From: "H. Peter Anvin" 

It is not only %ds and %es which contain cached user descriptor
information, %fs and %gs do as well.

To make sure we don't do something stupid that will affect processes
which wouldn't want this requalification, be more restrictive about
which selector numbers will be requalified: they need to be LDT
selectors (which by definition are never null), have an RPL of 3
(always the case in user space unless null), and match the updated
descriptor.

The infrastructure is set up to allow a range of descriptors; this
will be used in a subsequent patch.

Signed-off-by: H. Peter Anvin (Intel) 
Cc: Ingo Molnar 
Cc: Thomas Gleixner 
Cc: Andy Lutomirski 
Cc: Chang S. Bae 
Cc: Markus T. Metzger 
---
 arch/x86/kernel/ldt.c | 70 +++
 1 file changed, 54 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index c9b14020f4dd..18e9f4c0633d 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -29,36 +29,68 @@
 #include 
 #include 
 
-static void refresh_ldt_segments(void)
-{
+struct flush_ldt_info {
+   struct mm_struct *mm;
+   unsigned short first_desc;
+   unsigned short last_desc;
+};
+
 #ifdef CONFIG_X86_64
+
+static inline bool
+need_requalify(unsigned short sel, const struct flush_ldt_info *info)
+{
+   /* Must be an LDT segment descriptor with an RPL of 3 */
+   if ((sel & (SEGMENT_TI_MASK|SEGMENT_RPL_MASK)) != (SEGMENT_LDT|3))
+   return false;
+
+   return sel >= info->first_desc && sel <= info->last_desc;
+}
+
+static void refresh_ldt_segments(const struct flush_ldt_info *info)
+{
unsigned short sel;
 
/*
-* Make sure that the cached DS and ES descriptors match the updated
-* LDT.
+* Make sure that the cached DS/ES/FS/GS descriptors
+* match the updated LDT, if the specific selectors point
+* to LDT entries that have changed.
 */
savesegment(ds, sel);
-   if ((sel & SEGMENT_TI_MASK) == SEGMENT_LDT)
+   if (need_requalify(sel, info))
loadsegment(ds, sel);
 
savesegment(es, sel);
-   if ((sel & SEGMENT_TI_MASK) == SEGMENT_LDT)
+   if (need_requalify(sel, info))
loadsegment(es, sel);
-#endif
+
+   savesegment(fs, sel);
+   if (need_requalify(sel, info))
+   loadsegment(fs, sel);
+
+   savesegment(gs, sel);
+   if (need_requalify(sel, info))
+   load_gs_index(sel);
 }
 
+#else
+/* On 32 bits, entry_32.S takes care of this on kernel exit */
+static void refresh_ldt_segments(const struct flush_ldt_info *info)
+{
+   (void)info;
+}
+#endif
+
 /* context.lock is held by the task which issued the smp function call */
-static void flush_ldt(void *__mm)
+static void flush_ldt(void *_info)
 {
-   struct mm_struct *mm = __mm;
+   const struct flush_ldt_info *info = _info;
 
-   if (this_cpu_read(cpu_tlbstate.loaded_mm) != mm)
+   if (this_cpu_read(cpu_tlbstate.loaded_mm) != info->mm)
return;
 
-   load_mm_ldt(mm);
-
-   refresh_ldt_segments();
+   load_mm_ldt(info->mm);
+   refresh_ldt_segments(info);
 }
 
 /* The caller must call finalize_ldt_struct on the result. LDT starts zeroed. 
*/
@@ -223,15 +255,21 @@ static void finalize_ldt_struct(struct ldt_struct *ldt)
paravirt_alloc_ldt(ldt->entries, ldt->nr_entries);
 }
 
-static void install_ldt(struct mm_struct *mm, struct ldt_struct *ldt)
+static void install_ldt(struct mm_struct *mm, struct ldt_struct *ldt,
+   unsigned short first_index, unsigned short last_index)
 {
+   struct flush_ldt_info info;
+
mutex_lock(&mm->context.lock);
 
/* Synchronizes with READ_ONCE in load_mm_ldt. */
smp_store_release(&mm->context.ldt, ldt);
 
/* Activate the LDT for all CPUs using currents mm. */
-   on_each_cpu_mask(mm_cpumask(mm), flush_ldt, mm, true);
+   info.mm = mm;
+   info.first_desc = (first_index << 3)|SEGMENT_LDT|3;
+   info.last_desc  = (last_index << 3)|SEGMENT_LDT|3;
+   on_each_cpu_mask(mm_cpumask(mm), flush_ldt, &info, true);
 
mutex_unlock(&mm->context.lock);
 }
@@ -436,7 +474,7 @@ static int write_ldt(void __user *ptr, unsigned long 
bytecount, int oldmode)
goto out_unlock;
}
 
-   install_ldt(mm, new_ldt);
+   install_ldt(mm, new_ldt, ldt_info.entry_number, ldt_info.entry_number);
free_ldt_struct(old_ldt);
error = 0;
 
-- 
2.14.4



[PATCH 4/5] arch/parisc, termios: use

2018-10-04 Thread H. Peter Anvin (Intel)
The PARISC  definition of termbits.h is almost identical to the generic
one, so use the generic one and only redefine a handful of constants.

Signed-off-by: H. Peter Anvin (Intel) 
Cc: "James E.J. Bottomley" 
Cc: Helge Deller 
Cc: Kate Stewart 
Cc: Thomas Gleixner 
Cc: Philippe Ombredanne 
Cc: Greg Kroah-Hartman 
Cc: 
Cc: Greg Kroah-Hartman 
Cc: Jiri Slaby 
---
 arch/parisc/include/uapi/asm/termbits.h | 197 +---
 1 file changed, 4 insertions(+), 193 deletions(-)

diff --git a/arch/parisc/include/uapi/asm/termbits.h 
b/arch/parisc/include/uapi/asm/termbits.h
index 40e920f8d683..28ce9067f2e0 100644
--- a/arch/parisc/include/uapi/asm/termbits.h
+++ b/arch/parisc/include/uapi/asm/termbits.h
@@ -2,201 +2,12 @@
 #ifndef __ARCH_PARISC_TERMBITS_H__
 #define __ARCH_PARISC_TERMBITS_H__
 
-#include 
+#include 
 
-typedef unsigned char  cc_t;
-typedef unsigned int   speed_t;
-typedef unsigned int   tcflag_t;
-
-#define NCCS 19
-struct termios {
-   tcflag_t c_iflag;   /* input mode flags */
-   tcflag_t c_oflag;   /* output mode flags */
-   tcflag_t c_cflag;   /* control mode flags */
-   tcflag_t c_lflag;   /* local mode flags */
-   cc_t c_line;/* line discipline */
-   cc_t c_cc[NCCS];/* control characters */
-};
-
-struct termios2 {
-   tcflag_t c_iflag;   /* input mode flags */
-   tcflag_t c_oflag;   /* output mode flags */
-   tcflag_t c_cflag;   /* control mode flags */
-   tcflag_t c_lflag;   /* local mode flags */
-   cc_t c_line;/* line discipline */
-   cc_t c_cc[NCCS];/* control characters */
-   speed_t c_ispeed;   /* input speed */
-   speed_t c_ospeed;   /* output speed */
-};
-
-struct ktermios {
-   tcflag_t c_iflag;   /* input mode flags */
-   tcflag_t c_oflag;   /* output mode flags */
-   tcflag_t c_cflag;   /* control mode flags */
-   tcflag_t c_lflag;   /* local mode flags */
-   cc_t c_line;/* line discipline */
-   cc_t c_cc[NCCS];/* control characters */
-   speed_t c_ispeed;   /* input speed */
-   speed_t c_ospeed;   /* output speed */
-};
-
-/* c_cc characters */
-#define VINTR 0
-#define VQUIT 1
-#define VERASE 2
-#define VKILL 3
-#define VEOF 4
-#define VTIME 5
-#define VMIN 6
-#define VSWTC 7
-#define VSTART 8
-#define VSTOP 9
-#define VSUSP 10
-#define VEOL 11
-#define VREPRINT 12
-#define VDISCARD 13
-#define VWERASE 14
-#define VLNEXT 15
-#define VEOL2 16
-
-
-/* c_iflag bits */
-#define IGNBRK 001
-#define BRKINT 002
-#define IGNPAR 004
-#define PARMRK 010
-#define INPCK  020
-#define ISTRIP 040
-#define INLCR  100
-#define IGNCR  200
-#define ICRNL  400
-#define IUCLC  0001000
-#define IXON   0002000
-#define IXANY  0004000
-#define IXOFF  001
+/* c_iflag bits that differ from the generic ABI */
+#undef IMAXBEL
 #define IMAXBEL004
+#undef IUTF8
 #define IUTF8  010
 
-/* c_oflag bits */
-#define OPOST  001
-#define OLCUC  002
-#define ONLCR  004
-#define OCRNL  010
-#define ONOCR  020
-#define ONLRET 040
-#define OFILL  100
-#define OFDEL  200
-#define NLDLY  400
-#define   NL0  000
-#define   NL1  400
-#define CRDLY  0003000
-#define   CR0  000
-#define   CR1  0001000
-#define   CR2  0002000
-#define   CR3  0003000
-#define TABDLY 0014000
-#define   TAB0 000
-#define   TAB1 0004000
-#define   TAB2 001
-#define   TAB3 0014000
-#define   XTABS0014000
-#define BSDLY  002
-#define   BS0  000
-#define   BS1  002
-#define VTDLY  004
-#define   VT0  000
-#define   VT1  004
-#define FFDLY  010
-#define   FF0  000
-#define   FF1  010
-
-/* c_cflag bit meaning */
-#define CBAUD   0010017
-#define  B0 000 /* hang up */
-#define  B50001
-#define  B75002
-#define  B110   003
-#define  B134   004
-#define  B150   005
-#define  B200   006
-#define  B300   007
-#define  B600   010
-#define  B1200  011
-#define  B1800  012
-#define  B2400  013
-#define  B4800  014
-#define  B9600  015
-#define  B19200 016
-#define  B38400 017
-#define EXTA B19200
-#define EXTB B38400
-#define CSIZE   060
-#define   CS5   000
-#define   CS6   020
-#define   CS7   040
-#define   CS8   060
-#define CSTOPB  100
-#define CREAD   200
-#define PARENB  400
-#define PARODD  0001000
-#define HUPCL   0002000
-#define CLOCAL  0004000
-#define CBAUDEX 001
-#defineBOTHER 001
-#defineB57600 0010001
-#define   B115200 0010002
-#define   B230400 0010003
-#define   B460800 0010004
-#define   B50 0010005
-#define   B5760

[PATCH 1/5] asm-generic, termios: add alias constants from MIPS

2018-10-04 Thread H. Peter Anvin (Intel)
Some architectures, in this case MIPS, need a couple of legacy alias
constants for bits. There really is no reason why we can't define them
generically for all architectures.

Signed-off-by: H. Peter Anvin (Intel) 
Cc: Arnd Bergmann 
Cc: Greg Kroah-Hartman 
Cc: Jiri Slaby 
linux-kernel@vger.kernel.org (open list)
---
 include/uapi/asm-generic/termbits.h | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/uapi/asm-generic/termbits.h 
b/include/uapi/asm-generic/termbits.h
index 2fbaf9ae89dd..96ae175eec5b 100644
--- a/include/uapi/asm-generic/termbits.h
+++ b/include/uapi/asm-generic/termbits.h
@@ -8,7 +8,10 @@ typedef unsigned char  cc_t;
 typedef unsigned int   speed_t;
 typedef unsigned int   tcflag_t;
 
-#define NCCS 19
+#ifndef NCCS
+# define NCCS 19
+#endif
+
 struct termios {
tcflag_t c_iflag;   /* input mode flags */
tcflag_t c_oflag;   /* output mode flags */
@@ -49,6 +52,7 @@ struct ktermios {
 #define VTIME 5
 #define VMIN 6
 #define VSWTC 7
+#define VSWTCH VSWTC
 #define VSTART 8
 #define VSTOP 9
 #define VSUSP 10
@@ -173,6 +177,7 @@ struct ktermios {
 #define ECHONL 100
 #define NOFLSH 200
 #define TOSTOP 400
+#define ITOSTOP TOSTOP
 #define ECHOCTL0001000
 #define ECHOPRT0002000
 #define ECHOKE 0004000
-- 
2.14.4



[PATCH 2/5] arch/ia64, termios: use

2018-10-04 Thread H. Peter Anvin (Intel)
The IA64 definition of termbits.h is identical to the generic.

Signed-off-by: H. Peter Anvin (Intel) 
Cc: Tony Luck 
Cc: Fenghua Yu 
Cc: Kate Stewart 
Cc: Philippe Ombredanne 
Cc: Greg Kroah-Hartman 
Cc: Thomas Gleixner 
Cc: 
Cc: Greg Kroah-Hartman 
CC: Jiri Slaby 
---
 arch/ia64/include/uapi/asm/termbits.h | 210 +-
 1 file changed, 1 insertion(+), 209 deletions(-)

diff --git a/arch/ia64/include/uapi/asm/termbits.h 
b/arch/ia64/include/uapi/asm/termbits.h
index 000a1a297c75..3935b106de79 100644
--- a/arch/ia64/include/uapi/asm/termbits.h
+++ b/arch/ia64/include/uapi/asm/termbits.h
@@ -1,209 +1 @@
-/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
-#ifndef _ASM_IA64_TERMBITS_H
-#define _ASM_IA64_TERMBITS_H
-
-/*
- * Based on .
- *
- * Modified 1999
- * David Mosberger-Tang , Hewlett-Packard Co
- *
- * 99/01/28Added new baudrates
- */
-
-#include 
-
-typedef unsigned char  cc_t;
-typedef unsigned int   speed_t;
-typedef unsigned int   tcflag_t;
-
-#define NCCS 19
-struct termios {
-   tcflag_t c_iflag;   /* input mode flags */
-   tcflag_t c_oflag;   /* output mode flags */
-   tcflag_t c_cflag;   /* control mode flags */
-   tcflag_t c_lflag;   /* local mode flags */
-   cc_t c_line;/* line discipline */
-   cc_t c_cc[NCCS];/* control characters */
-};
-
-struct termios2 {
-   tcflag_t c_iflag;   /* input mode flags */
-   tcflag_t c_oflag;   /* output mode flags */
-   tcflag_t c_cflag;   /* control mode flags */
-   tcflag_t c_lflag;   /* local mode flags */
-   cc_t c_line;/* line discipline */
-   cc_t c_cc[NCCS];/* control characters */
-   speed_t c_ispeed;   /* input speed */
-   speed_t c_ospeed;   /* output speed */
-};
-
-struct ktermios {
-   tcflag_t c_iflag;   /* input mode flags */
-   tcflag_t c_oflag;   /* output mode flags */
-   tcflag_t c_cflag;   /* control mode flags */
-   tcflag_t c_lflag;   /* local mode flags */
-   cc_t c_line;/* line discipline */
-   cc_t c_cc[NCCS];/* control characters */
-   speed_t c_ispeed;   /* input speed */
-   speed_t c_ospeed;   /* output speed */
-};
-
-/* c_cc characters */
-#define VINTR 0
-#define VQUIT 1
-#define VERASE 2
-#define VKILL 3
-#define VEOF 4
-#define VTIME 5
-#define VMIN 6
-#define VSWTC 7
-#define VSTART 8
-#define VSTOP 9
-#define VSUSP 10
-#define VEOL 11
-#define VREPRINT 12
-#define VDISCARD 13
-#define VWERASE 14
-#define VLNEXT 15
-#define VEOL2 16
-
-/* c_iflag bits */
-#define IGNBRK 001
-#define BRKINT 002
-#define IGNPAR 004
-#define PARMRK 010
-#define INPCK  020
-#define ISTRIP 040
-#define INLCR  100
-#define IGNCR  200
-#define ICRNL  400
-#define IUCLC  0001000
-#define IXON   0002000
-#define IXANY  0004000
-#define IXOFF  001
-#define IMAXBEL002
-#define IUTF8  004
-
-/* c_oflag bits */
-#define OPOST  001
-#define OLCUC  002
-#define ONLCR  004
-#define OCRNL  010
-#define ONOCR  020
-#define ONLRET 040
-#define OFILL  100
-#define OFDEL  200
-#define NLDLY  400
-#define   NL0  000
-#define   NL1  400
-#define CRDLY  0003000
-#define   CR0  000
-#define   CR1  0001000
-#define   CR2  0002000
-#define   CR3  0003000
-#define TABDLY 0014000
-#define   TAB0 000
-#define   TAB1 0004000
-#define   TAB2 001
-#define   TAB3 0014000
-#define   XTABS0014000
-#define BSDLY  002
-#define   BS0  000
-#define   BS1  002
-#define VTDLY  004
-#define   VT0  000
-#define   VT1  004
-#define FFDLY  010
-#define   FF0  000
-#define   FF1  010
-
-/* c_cflag bit meaning */
-#define CBAUD  0010017
-#define  B0000 /* hang up */
-#define  B50   001
-#define  B75   002
-#define  B110  003
-#define  B134  004
-#define  B150  005
-#define  B200  006
-#define  B300  007
-#define  B600  010
-#define  B1200 011
-#define  B1800 012
-#define  B2400 013
-#define  B4800 014
-#define  B9600 015
-#define  B19200016
-#define  B38400017
-#define EXTA B19200
-#define EXTB B38400
-#define CSIZE  060
-#define   CS5  000
-#define   CS6  020
-#define   CS7  040
-#define   CS8  060
-#define CSTOPB 100
-#define CREAD  200
-#define PARENB 400
-#define PARODD 0001000
-#define HUPCL  0002000
-#define CLOCAL 0004000
-#define CBAUDEX 001
-#defineBOTHER 001
-#defineB57600 0010001
-#define   B115200 0010002
-#define   B230400 0010003
-#define   B460800 0010004
-#define   B50 0010005
-#define   B576000 0010006
-#define   B921600 0010007

[PATCH 5/5] arch/xtensa, termios: use

2018-10-04 Thread H. Peter Anvin (Intel)
The Xtensa definition of termbits.h is identical to the generic one.

Signed-off-by: H. Peter Anvin (Intel) 
Cc: Chris Zankel 
Cc: Max Filippov 
Cc: Thomas Gleixner 
Cc: Greg Kroah-Hartman 
Cc: Philippe Ombredanne 
Cc: Kate Stewart 
Cc: 
Cc: Greg Kroah-Hartman 
Cc: Jiri Slaby 
---
 arch/xtensa/include/uapi/asm/termbits.h | 222 +---
 1 file changed, 1 insertion(+), 221 deletions(-)

diff --git a/arch/xtensa/include/uapi/asm/termbits.h 
b/arch/xtensa/include/uapi/asm/termbits.h
index d4206a7c5138..3935b106de79 100644
--- a/arch/xtensa/include/uapi/asm/termbits.h
+++ b/arch/xtensa/include/uapi/asm/termbits.h
@@ -1,221 +1 @@
-/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
-/*
- * include/asm-xtensa/termbits.h
- *
- * Copied from SH.
- *
- * This file is subject to the terms and conditions of the GNU General Public
- * License.  See the file "COPYING" in the main directory of this archive
- * for more details.
- *
- * Copyright (C) 2001 - 2005 Tensilica Inc.
- */
-
-#ifndef _XTENSA_TERMBITS_H
-#define _XTENSA_TERMBITS_H
-
-
-#include 
-
-typedef unsigned char  cc_t;
-typedef unsigned int   speed_t;
-typedef unsigned int   tcflag_t;
-
-#define NCCS 19
-struct termios {
-   tcflag_t c_iflag;   /* input mode flags */
-   tcflag_t c_oflag;   /* output mode flags */
-   tcflag_t c_cflag;   /* control mode flags */
-   tcflag_t c_lflag;   /* local mode flags */
-   cc_t c_line;/* line discipline */
-   cc_t c_cc[NCCS];/* control characters */
-};
-
-struct termios2 {
-   tcflag_t c_iflag;   /* input mode flags */
-   tcflag_t c_oflag;   /* output mode flags */
-   tcflag_t c_cflag;   /* control mode flags */
-   tcflag_t c_lflag;   /* local mode flags */
-   cc_t c_line;/* line discipline */
-   cc_t c_cc[NCCS];/* control characters */
-   speed_t c_ispeed;   /* input speed */
-   speed_t c_ospeed;   /* output speed */
-};
-
-struct ktermios {
-   tcflag_t c_iflag;   /* input mode flags */
-   tcflag_t c_oflag;   /* output mode flags */
-   tcflag_t c_cflag;   /* control mode flags */
-   tcflag_t c_lflag;   /* local mode flags */
-   cc_t c_line;/* line discipline */
-   cc_t c_cc[NCCS];/* control characters */
-   speed_t c_ispeed;   /* input speed */
-   speed_t c_ospeed;   /* output speed */
-};
-
-/* c_cc characters */
-
-#define VINTR 0
-#define VQUIT 1
-#define VERASE 2
-#define VKILL 3
-#define VEOF 4
-#define VTIME 5
-#define VMIN 6
-#define VSWTC 7
-#define VSTART 8
-#define VSTOP 9
-#define VSUSP 10
-#define VEOL 11
-#define VREPRINT 12
-#define VDISCARD 13
-#define VWERASE 14
-#define VLNEXT 15
-#define VEOL2 16
-
-/* c_iflag bits */
-
-#define IGNBRK 001
-#define BRKINT 002
-#define IGNPAR 004
-#define PARMRK 010
-#define INPCK  020
-#define ISTRIP 040
-#define INLCR  100
-#define IGNCR  200
-#define ICRNL  400
-#define IUCLC  0001000
-#define IXON   0002000
-#define IXANY  0004000
-#define IXOFF  001
-#define IMAXBEL002
-#define IUTF8  004
-
-/* c_oflag bits */
-
-#define OPOST  001
-#define OLCUC  002
-#define ONLCR  004
-#define OCRNL  010
-#define ONOCR  020
-#define ONLRET 040
-#define OFILL  100
-#define OFDEL  200
-#define NLDLY  400
-#define   NL0  000
-#define   NL1  400
-#define CRDLY  0003000
-#define   CR0  000
-#define   CR1  0001000
-#define   CR2  0002000
-#define   CR3  0003000
-#define TABDLY 0014000
-#define   TAB0 000
-#define   TAB1 0004000
-#define   TAB2 001
-#define   TAB3 0014000
-#define   XTABS0014000
-#define BSDLY  002
-#define   BS0  000
-#define   BS1  002
-#define VTDLY  004
-#define   VT0  000
-#define   VT1  004
-#define FFDLY  010
-#define   FF0  000
-#define   FF1  010
-
-/* c_cflag bit meaning */
-
-#define CBAUD  0010017
-#define  B0000 /* hang up */
-#define  B50   001
-#define  B75   002
-#define  B110  003
-#define  B134  004
-#define  B150  005
-#define  B200  006
-#define  B300  007
-#define  B600  010
-#define  B1200 011
-#define  B1800 012
-#define  B2400 013
-#define  B4800 014
-#define  B9600 015
-#define  B19200016
-#define  B38400017
-#define EXTA B19200
-#define EXTB B38400
-#define CSIZE  060
-#define   CS5  000
-#define   CS6  020
-#define   CS7  040
-#define   CS8  060
-#define CSTOPB 100
-#define CREAD  200
-#define PARENB 400
-#define PARODD 0001000
-#define HUPCL  0002000
-#define CLOCAL 0004000
-#define CBAUDEX 001
-#define   

[PATCH 0/5] termios: remove arch redundancy in

2018-10-04 Thread H. Peter Anvin (Intel)
 is one of those files which define an ABI. Some were
made different due to the desire to be compatible with legacy
architectures, others were mostly direct copies of the i386
definitions, which are now in asm-generic.

This folds the IA64, MIPS, PA-RISC, and Xtensa implementations into
the generic one.  IA64 and Xtensa are identical, MIPS and PA-RISC are
trivially different and just need a handful of constants redefined.

 has a few very minor adjustments to allow this.

 arch/ia64/include/uapi/asm/termbits.h   | 210 +-
 arch/mips/include/uapi/asm/ioctls.h |   2 +
 arch/mips/include/uapi/asm/termbits.h   | 213 ++
 arch/parisc/include/uapi/asm/termbits.h | 197 +---
 arch/xtensa/include/uapi/asm/termbits.h | 222 +---
 include/uapi/asm-generic/termbits.h |   7 +-
 6 files changed, 27 insertions(+), 824 deletions(-)

Signed-off-by: H. Peter Anvin (Intel) 
Cc: "James E.J. Bottomley" 
Cc: Arnd Bergmann 
Cc: Chris Zankel 
Cc: Fenghua Yu 
Cc: Greg Kroah-Hartman 
Cc: Helge Deller 
Cc: James Hogan 
Cc: Jiri Slaby 
Cc: Kate Stewart 
Cc: Max Filippov 
Cc: Paul Burton 
Cc: Philippe Ombredanne 
Cc: Ralf Baechle 
Cc: Thomas Gleixner 
Cc: Tony Luck 
Cc: 
Cc: 
Cc: 
Cc: 


[PATCH 3/5] arch/mips, termios: use

2018-10-04 Thread H. Peter Anvin (Intel)
The MIPS definition of termbits.h is almost identical to the generic
one, so use the generic one and only redefine a handful of constants.

Move TIOCSER_TEMT to ioctls.h where it lives for all other
architectures.

Signed-off-by: H. Peter Anvin (Intel) 
Cc: Ralf Baechle 
Cc: Paul Burton 
Cc: James Hogan 
Cc: Philippe Ombredanne 
Cc: Greg Kroah-Hartman 
Cc: Kate Stewart 
Cc: Thomas Gleixner 
Cc: 
Cc: Greg Kroah-Hartman 
Cc: Jiri Slaby 
---
 arch/mips/include/uapi/asm/ioctls.h   |   2 +
 arch/mips/include/uapi/asm/termbits.h | 213 +++---
 2 files changed, 15 insertions(+), 200 deletions(-)

diff --git a/arch/mips/include/uapi/asm/ioctls.h 
b/arch/mips/include/uapi/asm/ioctls.h
index 890245a9f0c4..a4e11e1438b5 100644
--- a/arch/mips/include/uapi/asm/ioctls.h
+++ b/arch/mips/include/uapi/asm/ioctls.h
@@ -114,4 +114,6 @@
 #define TIOCMIWAIT 0x5491 /* wait for a change on serial input line(s) */
 #define TIOCGICOUNT0x5492 /* read serial port inline interrupt counts */
 
+#define TIOCSER_TEMT   0x01/* Transmitter physically empty */
+
 #endif /* __ASM_IOCTLS_H */
diff --git a/arch/mips/include/uapi/asm/termbits.h 
b/arch/mips/include/uapi/asm/termbits.h
index dfeffba729b7..a08fa0a697f5 100644
--- a/arch/mips/include/uapi/asm/termbits.h
+++ b/arch/mips/include/uapi/asm/termbits.h
@@ -11,218 +11,31 @@
 #ifndef _ASM_TERMBITS_H
 #define _ASM_TERMBITS_H
 
-#include 
+#define NCCS 23
+#include 
 
-typedef unsigned char cc_t;
-typedef unsigned int speed_t;
-typedef unsigned int tcflag_t;
-
-/*
- * The ABI says nothing about NCC but seems to use NCCS as
- * replacement for it in struct termio
- */
-#define NCCS   23
-struct termios {
-   tcflag_t c_iflag;   /* input mode flags */
-   tcflag_t c_oflag;   /* output mode flags */
-   tcflag_t c_cflag;   /* control mode flags */
-   tcflag_t c_lflag;   /* local mode flags */
-   cc_t c_line;/* line discipline */
-   cc_t c_cc[NCCS];/* control characters */
-};
-
-struct termios2 {
-   tcflag_t c_iflag;   /* input mode flags */
-   tcflag_t c_oflag;   /* output mode flags */
-   tcflag_t c_cflag;   /* control mode flags */
-   tcflag_t c_lflag;   /* local mode flags */
-   cc_t c_line;/* line discipline */
-   cc_t c_cc[NCCS];/* control characters */
-   speed_t c_ispeed;   /* input speed */
-   speed_t c_ospeed;   /* output speed */
-};
-
-struct ktermios {
-   tcflag_t c_iflag;   /* input mode flags */
-   tcflag_t c_oflag;   /* output mode flags */
-   tcflag_t c_cflag;   /* control mode flags */
-   tcflag_t c_lflag;   /* local mode flags */
-   cc_t c_line;/* line discipline */
-   cc_t c_cc[NCCS];/* control characters */
-   speed_t c_ispeed;   /* input speed */
-   speed_t c_ospeed;   /* output speed */
-};
-
-/* c_cc characters */
-#define VINTR   0  /* Interrupt character [ISIG].  */
-#define VQUIT   1  /* Quit character [ISIG].  */
-#define VERASE  2  /* Erase character [ICANON].  */
-#define VKILL   3  /* Kill-line character [ICANON].  */
+/* c_cc characters that differ from the generic ABI */
+#undef VMIN
 #define VMIN4  /* Minimum number of bytes read at once 
[!ICANON].  */
-#define VTIME   5  /* Time-out value (tenths of a second) 
[!ICANON].  */
+#undef VEOL2
 #define VEOL2   6  /* Second EOL character [ICANON].  */
-#define VSWTC   7  /* ??? */
-#define VSWTCH VSWTC
-#define VSTART  8  /* Start (X-ON) character [IXON, 
IXOFF].  */
-#define VSTOP   9  /* Stop (X-OFF) character [IXON, 
IXOFF].  */
-#define VSUSP  10  /* Suspend character [ISIG].  */
-#if 0
-/*
- * VDSUSP is not supported
- */
-#define VDSUSP 11  /* Delayed suspend character [ISIG].  */
-#endif
-#define VREPRINT   12  /* Reprint-line character [ICANON].  */
-#define VDISCARD   13  /* Discard character [IEXTEN].  */
-#define VWERASE14  /* Word-erase character 
[ICANON].  */
-#define VLNEXT 15  /* Literal-next character [IEXTEN].  */
+#undef VEOF
 #define VEOF   16  /* End-of-file character [ICANON].  */
+#undef VEOL
 #define VEOL   17  /* End-of-line character [ICANON].  */
 
-/* c_iflag bits */
-#define IGNBRK 001 /* Ignore break condition.  */
-#define BRKINT 002 /* Signal interrupt on break.  */
-#define IGNPAR 004 /* Ignore characters with parity errors.  */
-#define

[tip:x86/pti] sched/smt: Make sched_smt_present track topology

2018-11-28 Thread tip-bot for Peter Zijlstra (Intel)
Commit-ID:  c5511d03ec090980732e929c318a7a6374b5550e
Gitweb: https://git.kernel.org/tip/c5511d03ec090980732e929c318a7a6374b5550e
Author: Peter Zijlstra (Intel) 
AuthorDate: Sun, 25 Nov 2018 19:33:36 +0100
Committer:  Thomas Gleixner 
CommitDate: Wed, 28 Nov 2018 11:57:06 +0100

sched/smt: Make sched_smt_present track topology

Currently the 'sched_smt_present' static key is enabled when at CPU bringup
SMT topology is observed, but it is never disabled. However there is demand
to also disable the key when the topology changes such that there is no SMT
present anymore.

Implement this by making the key count the number of cores that have SMT
enabled.

In particular, the SMT topology bits are set before interrrupts are enabled
and similarly, are cleared after interrupts are disabled for the last time
and the CPU dies.

Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Thomas Gleixner 
Reviewed-by: Ingo Molnar 
Cc: Andy Lutomirski 
Cc: Linus Torvalds 
Cc: Jiri Kosina 
Cc: Tom Lendacky 
Cc: Josh Poimboeuf 
Cc: Andrea Arcangeli 
Cc: David Woodhouse 
Cc: Tim Chen 
Cc: Andi Kleen 
Cc: Dave Hansen 
Cc: Casey Schaufler 
Cc: Asit Mallick 
Cc: Arjan van de Ven 
Cc: Jon Masters 
Cc: Waiman Long 
Cc: Greg KH 
Cc: Dave Stewart 
Cc: Kees Cook 
Cc: sta...@vger.kernel.org
Link: https://lkml.kernel.org/r/20181125185004.246110...@linutronix.de


---
 kernel/sched/core.c | 19 +++
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 091e089063be..6fedf3a98581 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5738,15 +5738,10 @@ int sched_cpu_activate(unsigned int cpu)
 
 #ifdef CONFIG_SCHED_SMT
/*
-* The sched_smt_present static key needs to be evaluated on every
-* hotplug event because at boot time SMT might be disabled when
-* the number of booted CPUs is limited.
-*
-* If then later a sibling gets hotplugged, then the key would stay
-* off and SMT scheduling would never be functional.
+* When going up, increment the number of cores with SMT present.
 */
-   if (cpumask_weight(cpu_smt_mask(cpu)) > 1)
-   static_branch_enable_cpuslocked(&sched_smt_present);
+   if (cpumask_weight(cpu_smt_mask(cpu)) == 2)
+   static_branch_inc_cpuslocked(&sched_smt_present);
 #endif
set_cpu_active(cpu, true);
 
@@ -5790,6 +5785,14 @@ int sched_cpu_deactivate(unsigned int cpu)
 */
synchronize_rcu_mult(call_rcu, call_rcu_sched);
 
+#ifdef CONFIG_SCHED_SMT
+   /*
+* When going down, decrement the number of cores with SMT present.
+*/
+   if (cpumask_weight(cpu_smt_mask(cpu)) == 2)
+   static_branch_dec_cpuslocked(&sched_smt_present);
+#endif
+
if (!sched_smp_initialized)
return 0;
 


[tip:x86/boot] x86/kaslr, ACPI/NUMA: Fix KASLR build error

2018-10-09 Thread tip-bot for Peter Zijlstra (Intel)
Commit-ID:  9d94e8b1d4f94a3c4cee5ad11a1be460cd070839
Gitweb: https://git.kernel.org/tip/9d94e8b1d4f94a3c4cee5ad11a1be460cd070839
Author: Peter Zijlstra (Intel) 
AuthorDate: Wed, 3 Oct 2018 14:41:27 +0200
Committer:  Borislav Petkov 
CommitDate: Tue, 9 Oct 2018 12:30:25 +0200

x86/kaslr, ACPI/NUMA: Fix KASLR build error

There is no point in trying to compile KASLR-specific code when there is
no KASLR.

 [ bp: Move the whole crap into kaslr.c and make
   rand_mem_physical_padding static. Make kaslr_check_padding()
   weak to avoid build breakage on other architectures. ]

Reported-by: Naresh Kamboju 
Reported-by: Mark Brown 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Borislav Petkov 
Cc: 
Cc: 
Cc: 
Cc: 
Cc: 
Cc: 
Link: 
http://lkml.kernel.org/r/20181003123402.ga15...@hirez.programming.kicks-ass.net
---
 arch/x86/include/asm/setup.h |  2 --
 arch/x86/mm/kaslr.c  | 19 ++-
 drivers/acpi/numa.c  | 17 +
 3 files changed, 23 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
index 65a5bf8f6aba..ae13bc974416 100644
--- a/arch/x86/include/asm/setup.h
+++ b/arch/x86/include/asm/setup.h
@@ -80,8 +80,6 @@ static inline unsigned long kaslr_offset(void)
return (unsigned long)&_text - __START_KERNEL;
 }
 
-extern int rand_mem_physical_padding;
-
 /*
  * Do NOT EVER look at the BIOS memory size location.
  * It does not work on many machines.
diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c
index 00cf4cae38f5..b3471388288d 100644
--- a/arch/x86/mm/kaslr.c
+++ b/arch/x86/mm/kaslr.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -40,7 +41,7 @@
  */
 static const unsigned long vaddr_end = CPU_ENTRY_AREA_BASE;
 
-int __initdata rand_mem_physical_padding = 
CONFIG_RANDOMIZE_MEMORY_PHYSICAL_PADDING;
+static int __initdata rand_mem_physical_padding = 
CONFIG_RANDOMIZE_MEMORY_PHYSICAL_PADDING;
 /*
  * Memory regions randomized by KASLR (except modules that use a separate logic
  * earlier during boot). The list is ordered based on virtual addresses. This
@@ -70,6 +71,22 @@ static inline bool kaslr_memory_enabled(void)
return kaslr_enabled() && !IS_ENABLED(CONFIG_KASAN);
 }
 
+/*
+ * Check the padding size for KASLR is enough.
+ */
+void __init kaslr_check_padding(void)
+{
+   u64 max_possible_phys, max_actual_phys, threshold;
+
+   max_actual_phys = roundup(PFN_PHYS(max_pfn), 1ULL << 40);
+   max_possible_phys = roundup(PFN_PHYS(max_possible_pfn), 1ULL << 40);
+   threshold = max_actual_phys + ((u64)rand_mem_physical_padding << 40);
+
+   if (max_possible_phys > threshold)
+   pr_warn("Set 'rand_mem_physical_padding=%llu' to avoid memory 
hotadd failure.\n",
+   (max_possible_phys - max_actual_phys) >> 40);
+}
+
 static int __init rand_mem_physical_padding_setup(char *str)
 {
int max_padding = (1 << (MAX_PHYSMEM_BITS - TB_SHIFT)) - 1;
diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
index 3d69834c692f..ba62004f4d86 100644
--- a/drivers/acpi/numa.c
+++ b/drivers/acpi/numa.c
@@ -32,7 +32,6 @@
 #include 
 #include 
 #include 
-#include 
 
 static nodemask_t nodes_found_map = NODE_MASK_NONE;
 
@@ -433,10 +432,12 @@ acpi_table_parse_srat(enum acpi_srat_type id,
handler, max_entries);
 }
 
+/* To be overridden by architectures */
+void __init __weak kaslr_check_padding(void) { }
+
 int __init acpi_numa_init(void)
 {
int cnt = 0;
-   u64 max_possible_phys, max_actual_phys, threshold;
 
if (acpi_disabled)
return -EINVAL;
@@ -466,17 +467,9 @@ int __init acpi_numa_init(void)
cnt = acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY,
acpi_parse_memory_affinity, 0);
 
-   /* check the padding size for KASLR is enough. */
-   if (parsed_numa_memblks && kaslr_enabled()) {
-   max_actual_phys = roundup(PFN_PHYS(max_pfn), 1ULL << 
40);
-   max_possible_phys = roundup(PFN_PHYS(max_possible_pfn), 
1ULL << 40);
-   threshold = max_actual_phys + 
((u64)rand_mem_physical_padding << 40);
+   if (parsed_numa_memblks)
+   kaslr_check_padding();
 
-   if (max_possible_phys > threshold) {
-   pr_warn("Set 'rand_mem_physical_padding=%llu' 
to avoid memory hotadd failure.\n",
- (max_possible_phys - max_actual_phys) >> 40);
-   }
-   }
}
 
/* SLIT: System Locality Information Table */


[tip:perf/core] perf: Avoid horrible stack usage

2015-01-14 Thread tip-bot for Peter Zijlstra (Intel)
Commit-ID:  86038c5ea81b519a8a1fcfcd5e4599aab0cdd119
Gitweb: http://git.kernel.org/tip/86038c5ea81b519a8a1fcfcd5e4599aab0cdd119
Author: Peter Zijlstra (Intel) 
AuthorDate: Tue, 16 Dec 2014 12:47:34 +0100
Committer:  Ingo Molnar 
CommitDate: Wed, 14 Jan 2015 15:11:45 +0100

perf: Avoid horrible stack usage

Both Linus (most recent) and Steve (a while ago) reported that perf
related callbacks have massive stack bloat.

The problem is that software events need a pt_regs in order to
properly report the event location and unwind stack. And because we
could not assume one was present we allocated one on stack and filled
it with minimal bits required for operation.

Now, pt_regs is quite large, so this is undesirable. Furthermore it
turns out that most sites actually have a pt_regs pointer available,
making this even more onerous, as the stack space is pointless waste.

This patch addresses the problem by observing that software events
have well defined nesting semantics, therefore we can use static
per-cpu storage instead of on-stack.

Linus made the further observation that all but the scheduler callers
of perf_sw_event() have a pt_regs available, so we change the regular
perf_sw_event() to require a valid pt_regs (where it used to be
optional) and add perf_sw_event_sched() for the scheduler.

We have a scheduler specific call instead of a more generic _noregs()
like construct because we can assume non-recursion from the scheduler
and thereby simplify the code further (_noregs would have to put the
recursion context call inline in order to assertain which __perf_regs
element to use).

One last note on the implementation of perf_trace_buf_prepare(); we
allow .regs = NULL for those cases where we already have a pt_regs
pointer available and do not need another.

Reported-by: Linus Torvalds 
Reported-by: Steven Rostedt 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Arnaldo Carvalho de Melo 
Cc: Javi Merino 
Cc: Linus Torvalds 
Cc: Mathieu Desnoyers 
Cc: Oleg Nesterov 
Cc: Paul Mackerras 
Cc: Petr Mladek 
Cc: Steven Rostedt 
Cc: Tom Zanussi 
Cc: Vaibhav Nagarnaik 
Link: 
http://lkml.kernel.org/r/20141216115041.gw3...@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar 
---
 include/linux/ftrace_event.h|  2 +-
 include/linux/perf_event.h  | 28 +---
 include/trace/ftrace.h  |  7 ---
 kernel/events/core.c| 23 +--
 kernel/sched/core.c |  2 +-
 kernel/trace/trace_event_perf.c |  4 +++-
 kernel/trace/trace_kprobe.c |  4 ++--
 kernel/trace/trace_syscalls.c   |  4 ++--
 kernel/trace/trace_uprobe.c |  2 +-
 9 files changed, 52 insertions(+), 24 deletions(-)

diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index 0bebb5c..d36f68b 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -595,7 +595,7 @@ extern int  ftrace_profile_set_filter(struct perf_event 
*event, int event_id,
 char *filter_str);
 extern void ftrace_profile_free_filter(struct perf_event *event);
 extern void *perf_trace_buf_prepare(int size, unsigned short type,
-   struct pt_regs *regs, int *rctxp);
+   struct pt_regs **regs, int *rctxp);
 
 static inline void
 perf_trace_buf_submit(void *raw_data, int size, int rctx, u64 addr,
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 4f7a61c..3a7bd80 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -665,6 +665,7 @@ static inline int is_software_event(struct perf_event 
*event)
 
 extern struct static_key perf_swevent_enabled[PERF_COUNT_SW_MAX];
 
+extern void ___perf_sw_event(u32, u64, struct pt_regs *, u64);
 extern void __perf_sw_event(u32, u64, struct pt_regs *, u64);
 
 #ifndef perf_arch_fetch_caller_regs
@@ -689,14 +690,25 @@ static inline void perf_fetch_caller_regs(struct pt_regs 
*regs)
 static __always_inline void
 perf_sw_event(u32 event_id, u64 nr, struct pt_regs *regs, u64 addr)
 {
-   struct pt_regs hot_regs;
+   if (static_key_false(&perf_swevent_enabled[event_id]))
+   __perf_sw_event(event_id, nr, regs, addr);
+}
+
+DECLARE_PER_CPU(struct pt_regs, __perf_regs[4]);
 
+/*
+ * 'Special' version for the scheduler, it hard assumes no recursion,
+ * which is guaranteed by us not actually scheduling inside other swevents
+ * because those disable preemption.
+ */
+static __always_inline void
+perf_sw_event_sched(u32 event_id, u64 nr, u64 addr)
+{
if (static_key_false(&perf_swevent_enabled[event_id])) {
-   if (!regs) {
-   perf_fetch_caller_regs(&hot_regs);
-   regs = &hot_regs;
-   }
-   __perf_sw_event(event_id, nr, regs, addr);
+   struct pt_regs *regs = this_cpu_ptr(&__perf_regs[0]);
+
+   perf_fetch_caller_regs(regs);
+   ___pe

[tip: sched/core] preempt/dynamic: Provide irqentry_exit_cond_resched() static call

2021-02-17 Thread tip-bot2 for Peter Zijlstra (Intel)
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 40607ee97e4eec5655cc0f76a720bdc4c63a6434
Gitweb:
https://git.kernel.org/tip/40607ee97e4eec5655cc0f76a720bdc4c63a6434
Author:Peter Zijlstra (Intel) 
AuthorDate:Mon, 18 Jan 2021 15:12:22 +01:00
Committer: Ingo Molnar 
CommitterDate: Wed, 17 Feb 2021 14:12:42 +01:00

preempt/dynamic: Provide irqentry_exit_cond_resched() static call

Provide static call to control IRQ preemption (called in CONFIG_PREEMPT)
so that we can override its behaviour when preempt= is overriden.

Since the default behaviour is full preemption, its call is
initialized to provide IRQ preemption when preempt= isn't passed.

Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Ingo Molnar 
Link: https://lkml.kernel.org/r/20210118141223.123667-8-frede...@kernel.org
---
 include/linux/entry-common.h |  4 
 kernel/entry/common.c| 10 +-
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index a104b29..883acef 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -2,6 +2,7 @@
 #ifndef __LINUX_ENTRYCOMMON_H
 #define __LINUX_ENTRYCOMMON_H
 
+#include 
 #include 
 #include 
 #include 
@@ -454,6 +455,9 @@ irqentry_state_t noinstr irqentry_enter(struct pt_regs 
*regs);
  * Conditional reschedule with additional sanity checks.
  */
 void irqentry_exit_cond_resched(void);
+#ifdef CONFIG_PREEMPT_DYNAMIC
+DECLARE_STATIC_CALL(irqentry_exit_cond_resched, irqentry_exit_cond_resched);
+#endif
 
 /**
  * irqentry_exit - Handle return from exception that used irqentry_enter()
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index f9d491b..f09cae3 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -385,6 +385,9 @@ void irqentry_exit_cond_resched(void)
preempt_schedule_irq();
}
 }
+#ifdef CONFIG_PREEMPT_DYNAMIC
+DEFINE_STATIC_CALL(irqentry_exit_cond_resched, irqentry_exit_cond_resched);
+#endif
 
 noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
 {
@@ -411,8 +414,13 @@ noinstr void irqentry_exit(struct pt_regs *regs, 
irqentry_state_t state)
}
 
instrumentation_begin();
-   if (IS_ENABLED(CONFIG_PREEMPTION))
+   if (IS_ENABLED(CONFIG_PREEMPTION)) {
+#ifdef CONFIG_PREEMT_DYNAMIC
+   static_call(irqentry_exit_cond_resched)();
+#else
irqentry_exit_cond_resched();
+#endif
+   }
/* Covers both tracing and lockdep */
trace_hardirqs_on();
instrumentation_end();


[tip: sched/core] preempt/dynamic: Provide preempt_schedule[_notrace]() static calls

2021-02-17 Thread tip-bot2 for Peter Zijlstra (Intel)
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 2c9a98d3bc808717ab63ad928a2b568967775388
Gitweb:
https://git.kernel.org/tip/2c9a98d3bc808717ab63ad928a2b568967775388
Author:Peter Zijlstra (Intel) 
AuthorDate:Mon, 18 Jan 2021 15:12:21 +01:00
Committer: Ingo Molnar 
CommitterDate: Wed, 17 Feb 2021 14:12:42 +01:00

preempt/dynamic: Provide preempt_schedule[_notrace]() static calls

Provide static calls to control preempt_schedule[_notrace]()
(called in CONFIG_PREEMPT) so that we can override their behaviour when
preempt= is overriden.

Since the default behaviour is full preemption, both their calls are
initialized to the arch provided wrapper, if any.

[fweisbec: only define static calls when PREEMPT_DYNAMIC, make it less
   dependent on x86 with __preempt_schedule_func]
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Ingo Molnar 
Link: https://lkml.kernel.org/r/20210118141223.123667-7-frede...@kernel.org
---
 arch/x86/include/asm/preempt.h | 34 +
 kernel/sched/core.c| 12 -
 2 files changed, 38 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/preempt.h b/arch/x86/include/asm/preempt.h
index 69485ca..9b12dce 100644
--- a/arch/x86/include/asm/preempt.h
+++ b/arch/x86/include/asm/preempt.h
@@ -5,6 +5,7 @@
 #include 
 #include 
 #include 
+#include 
 
 DECLARE_PER_CPU(int, __preempt_count);
 
@@ -103,16 +104,33 @@ static __always_inline bool should_resched(int 
preempt_offset)
 }
 
 #ifdef CONFIG_PREEMPTION
-  extern asmlinkage void preempt_schedule_thunk(void);
-# define __preempt_schedule() \
-   asm volatile ("call preempt_schedule_thunk" : ASM_CALL_CONSTRAINT)
 
-  extern asmlinkage void preempt_schedule(void);
-  extern asmlinkage void preempt_schedule_notrace_thunk(void);
-# define __preempt_schedule_notrace() \
-   asm volatile ("call preempt_schedule_notrace_thunk" : 
ASM_CALL_CONSTRAINT)
+extern asmlinkage void preempt_schedule(void);
+extern asmlinkage void preempt_schedule_thunk(void);
+
+#define __preempt_schedule_func preempt_schedule_thunk
+
+DECLARE_STATIC_CALL(preempt_schedule, __preempt_schedule_func);
+
+#define __preempt_schedule() \
+do { \
+   __ADDRESSABLE(STATIC_CALL_KEY(preempt_schedule)); \
+   asm volatile ("call " STATIC_CALL_TRAMP_STR(preempt_schedule) : 
ASM_CALL_CONSTRAINT); \
+} while (0)
+
+extern asmlinkage void preempt_schedule_notrace(void);
+extern asmlinkage void preempt_schedule_notrace_thunk(void);
+
+#define __preempt_schedule_notrace_func preempt_schedule_notrace_thunk
+
+DECLARE_STATIC_CALL(preempt_schedule_notrace, __preempt_schedule_notrace_func);
+
+#define __preempt_schedule_notrace() \
+do { \
+   __ADDRESSABLE(STATIC_CALL_KEY(preempt_schedule_notrace)); \
+   asm volatile ("call " STATIC_CALL_TRAMP_STR(preempt_schedule_notrace) : 
ASM_CALL_CONSTRAINT); \
+} while (0)
 
-  extern asmlinkage void preempt_schedule_notrace(void);
 #endif
 
 #endif /* __ASM_PREEMPT_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f7c8fd8..880611c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5265,6 +5265,12 @@ asmlinkage __visible void __sched notrace 
preempt_schedule(void)
 NOKPROBE_SYMBOL(preempt_schedule);
 EXPORT_SYMBOL(preempt_schedule);
 
+#ifdef CONFIG_PREEMPT_DYNAMIC
+DEFINE_STATIC_CALL(preempt_schedule, __preempt_schedule_func);
+EXPORT_STATIC_CALL(preempt_schedule);
+#endif
+
+
 /**
  * preempt_schedule_notrace - preempt_schedule called by tracing
  *
@@ -5317,6 +5323,12 @@ asmlinkage __visible void __sched notrace 
preempt_schedule_notrace(void)
 }
 EXPORT_SYMBOL_GPL(preempt_schedule_notrace);
 
+#ifdef CONFIG_PREEMPT_DYNAMIC
+DEFINE_STATIC_CALL(preempt_schedule_notrace, __preempt_schedule_notrace_func);
+EXPORT_STATIC_CALL(preempt_schedule_notrace);
+#endif
+
+
 #endif /* CONFIG_PREEMPTION */
 
 /*


[tip: sched/core] preempt/dynamic: Support dynamic preempt with preempt= boot option

2021-02-17 Thread tip-bot2 for Peter Zijlstra (Intel)
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 826bfeb37bb4302ee6042f330c4c0c757152bdb8
Gitweb:
https://git.kernel.org/tip/826bfeb37bb4302ee6042f330c4c0c757152bdb8
Author:Peter Zijlstra (Intel) 
AuthorDate:Mon, 18 Jan 2021 15:12:23 +01:00
Committer: Ingo Molnar 
CommitterDate: Wed, 17 Feb 2021 14:12:42 +01:00

preempt/dynamic: Support dynamic preempt with preempt= boot option

Support the preempt= boot option and patch the static call sites
accordingly.

Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Ingo Molnar 
Link: https://lkml.kernel.org/r/20210118141223.123667-9-frede...@kernel.org
---
 kernel/sched/core.c | 68 +++-
 1 file changed, 67 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 880611c..0c06717 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5328,9 +5328,75 @@ DEFINE_STATIC_CALL(preempt_schedule_notrace, 
__preempt_schedule_notrace_func);
 EXPORT_STATIC_CALL(preempt_schedule_notrace);
 #endif
 
-
 #endif /* CONFIG_PREEMPTION */
 
+#ifdef CONFIG_PREEMPT_DYNAMIC
+
+#include 
+
+/*
+ * SC:cond_resched
+ * SC:might_resched
+ * SC:preempt_schedule
+ * SC:preempt_schedule_notrace
+ * SC:irqentry_exit_cond_resched
+ *
+ *
+ * NONE:
+ *   cond_resched   <- __cond_resched
+ *   might_resched  <- RET0
+ *   preempt_schedule   <- NOP
+ *   preempt_schedule_notrace   <- NOP
+ *   irqentry_exit_cond_resched <- NOP
+ *
+ * VOLUNTARY:
+ *   cond_resched   <- __cond_resched
+ *   might_resched  <- __cond_resched
+ *   preempt_schedule   <- NOP
+ *   preempt_schedule_notrace   <- NOP
+ *   irqentry_exit_cond_resched <- NOP
+ *
+ * FULL:
+ *   cond_resched   <- RET0
+ *   might_resched  <- RET0
+ *   preempt_schedule   <- preempt_schedule
+ *   preempt_schedule_notrace   <- preempt_schedule_notrace
+ *   irqentry_exit_cond_resched <- irqentry_exit_cond_resched
+ */
+static int __init setup_preempt_mode(char *str)
+{
+   if (!strcmp(str, "none")) {
+   static_call_update(cond_resched, __cond_resched);
+   static_call_update(might_resched, (typeof(&__cond_resched)) 
__static_call_return0);
+   static_call_update(preempt_schedule, 
(typeof(&preempt_schedule)) NULL);
+   static_call_update(preempt_schedule_notrace, 
(typeof(&preempt_schedule_notrace)) NULL);
+   static_call_update(irqentry_exit_cond_resched, 
(typeof(&irqentry_exit_cond_resched)) NULL);
+   pr_info("Dynamic Preempt: %s\n", str);
+   } else if (!strcmp(str, "voluntary")) {
+   static_call_update(cond_resched, __cond_resched);
+   static_call_update(might_resched, __cond_resched);
+   static_call_update(preempt_schedule, 
(typeof(&preempt_schedule)) NULL);
+   static_call_update(preempt_schedule_notrace, 
(typeof(&preempt_schedule_notrace)) NULL);
+   static_call_update(irqentry_exit_cond_resched, 
(typeof(&irqentry_exit_cond_resched)) NULL);
+   pr_info("Dynamic Preempt: %s\n", str);
+   } else if (!strcmp(str, "full")) {
+   static_call_update(cond_resched, (typeof(&__cond_resched)) 
__static_call_return0);
+   static_call_update(might_resched, (typeof(&__cond_resched)) 
__static_call_return0);
+   static_call_update(preempt_schedule, __preempt_schedule_func);
+   static_call_update(preempt_schedule_notrace, 
__preempt_schedule_notrace_func);
+   static_call_update(irqentry_exit_cond_resched, 
irqentry_exit_cond_resched);
+   pr_info("Dynamic Preempt: %s\n", str);
+   } else {
+   pr_warn("Dynamic Preempt: Unsupported preempt mode %s, default 
to full\n", str);
+   return 1;
+   }
+   return 0;
+}
+__setup("preempt=", setup_preempt_mode);
+
+#endif /* CONFIG_PREEMPT_DYNAMIC */
+
+
 /*
  * This is the entry point to schedule() from kernel preemption
  * off of irq context.


[tip: sched/core] preempt/dynamic: Provide cond_resched() and might_resched() static calls

2021-02-17 Thread tip-bot2 for Peter Zijlstra (Intel)
The following commit has been merged into the sched/core branch of tip:

Commit-ID: b965f1ddb47daa5b8b2e2bc9c921431236830367
Gitweb:
https://git.kernel.org/tip/b965f1ddb47daa5b8b2e2bc9c921431236830367
Author:Peter Zijlstra (Intel) 
AuthorDate:Mon, 18 Jan 2021 15:12:20 +01:00
Committer: Ingo Molnar 
CommitterDate: Wed, 17 Feb 2021 14:12:42 +01:00

preempt/dynamic: Provide cond_resched() and might_resched() static calls

Provide static calls to control cond_resched() (called in !CONFIG_PREEMPT)
and might_resched() (called in CONFIG_PREEMPT_VOLUNTARY) to that we
can override their behaviour when preempt= is overriden.

Since the default behaviour is full preemption, both their calls are
ignored when preempt= isn't passed.

  [fweisbec: branch might_resched() directly to __cond_resched(), only
 define static calls when PREEMPT_DYNAMIC]

Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Ingo Molnar 
Link: https://lkml.kernel.org/r/20210118141223.123667-6-frede...@kernel.org
---
 include/linux/kernel.h | 23 +++
 include/linux/sched.h  | 27 ---
 kernel/sched/core.c| 16 +---
 3 files changed, 56 insertions(+), 10 deletions(-)

diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index f7902d8..cfd3d34 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -15,7 +15,7 @@
 #include 
 #include 
 #include 
-
+#include 
 #include 
 
 #include 
@@ -81,11 +81,26 @@ struct pt_regs;
 struct user;
 
 #ifdef CONFIG_PREEMPT_VOLUNTARY
-extern int _cond_resched(void);
-# define might_resched() _cond_resched()
+
+extern int __cond_resched(void);
+# define might_resched() __cond_resched()
+
+#elif defined(CONFIG_PREEMPT_DYNAMIC)
+
+extern int __cond_resched(void);
+
+DECLARE_STATIC_CALL(might_resched, __cond_resched);
+
+static __always_inline void might_resched(void)
+{
+   static_call(might_resched)();
+}
+
 #else
+
 # define might_resched() do { } while (0)
-#endif
+
+#endif /* CONFIG_PREEMPT_* */
 
 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
 extern void ___might_sleep(const char *file, int line, int preempt_offset);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index e115222..2f35594 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1871,11 +1871,32 @@ static inline int test_tsk_need_resched(struct 
task_struct *tsk)
  * value indicates whether a reschedule was done in fact.
  * cond_resched_lock() will drop the spinlock before scheduling,
  */
-#ifndef CONFIG_PREEMPTION
-extern int _cond_resched(void);
+#if !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC)
+extern int __cond_resched(void);
+
+#ifdef CONFIG_PREEMPT_DYNAMIC
+
+DECLARE_STATIC_CALL(cond_resched, __cond_resched);
+
+static __always_inline int _cond_resched(void)
+{
+   return static_call(cond_resched)();
+}
+
 #else
+
+static inline int _cond_resched(void)
+{
+   return __cond_resched();
+}
+
+#endif /* CONFIG_PREEMPT_DYNAMIC */
+
+#else
+
 static inline int _cond_resched(void) { return 0; }
-#endif
+
+#endif /* !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC) */
 
 #define cond_resched() ({  \
___might_sleep(__FILE__, __LINE__, 0);  \
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4afbdd2..f7c8fd8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6785,17 +6785,27 @@ SYSCALL_DEFINE0(sched_yield)
return 0;
 }
 
-#ifndef CONFIG_PREEMPTION
-int __sched _cond_resched(void)
+#if !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC)
+int __sched __cond_resched(void)
 {
if (should_resched(0)) {
preempt_schedule_common();
return 1;
}
+#ifndef CONFIG_PREEMPT_RCU
rcu_all_qs();
+#endif
return 0;
 }
-EXPORT_SYMBOL(_cond_resched);
+EXPORT_SYMBOL(__cond_resched);
+#endif
+
+#ifdef CONFIG_PREEMPT_DYNAMIC
+DEFINE_STATIC_CALL_RET0(cond_resched, __cond_resched);
+EXPORT_STATIC_CALL(cond_resched);
+
+DEFINE_STATIC_CALL_RET0(might_resched, __cond_resched);
+EXPORT_STATIC_CALL(might_resched);
 #endif
 
 /*


[tip: sched/core] preempt/dynamic: Support dynamic preempt with preempt= boot option

2021-02-09 Thread tip-bot2 for Peter Zijlstra (Intel)
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 0e79823f55de3cff95894fbb40440b17910e7378
Gitweb:
https://git.kernel.org/tip/0e79823f55de3cff95894fbb40440b17910e7378
Author:Peter Zijlstra (Intel) 
AuthorDate:Mon, 18 Jan 2021 15:12:23 +01:00
Committer: Peter Zijlstra 
CommitterDate: Tue, 09 Feb 2021 16:30:58 +01:00

preempt/dynamic: Support dynamic preempt with preempt= boot option

Support the preempt= boot option and patch the static call sites
accordingly.

Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20210118141223.123667-9-frede...@kernel.org
---
 kernel/sched/core.c | 68 +++-
 1 file changed, 67 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index cd0c46f..220393d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5314,9 +5314,75 @@ DEFINE_STATIC_CALL(preempt_schedule_notrace, 
__preempt_schedule_notrace_func);
 EXPORT_STATIC_CALL(preempt_schedule_notrace);
 #endif
 
-
 #endif /* CONFIG_PREEMPTION */
 
+#ifdef CONFIG_PREEMPT_DYNAMIC
+
+#include 
+
+/*
+ * SC:cond_resched
+ * SC:might_resched
+ * SC:preempt_schedule
+ * SC:preempt_schedule_notrace
+ * SC:irqentry_exit_cond_resched
+ *
+ *
+ * NONE:
+ *   cond_resched   <- __cond_resched
+ *   might_resched  <- RET0
+ *   preempt_schedule   <- NOP
+ *   preempt_schedule_notrace   <- NOP
+ *   irqentry_exit_cond_resched <- NOP
+ *
+ * VOLUNTARY:
+ *   cond_resched   <- __cond_resched
+ *   might_resched  <- __cond_resched
+ *   preempt_schedule   <- NOP
+ *   preempt_schedule_notrace   <- NOP
+ *   irqentry_exit_cond_resched <- NOP
+ *
+ * FULL:
+ *   cond_resched   <- RET0
+ *   might_resched  <- RET0
+ *   preempt_schedule   <- preempt_schedule
+ *   preempt_schedule_notrace   <- preempt_schedule_notrace
+ *   irqentry_exit_cond_resched <- irqentry_exit_cond_resched
+ */
+static int __init setup_preempt_mode(char *str)
+{
+   if (!strcmp(str, "none")) {
+   static_call_update(cond_resched, __cond_resched);
+   static_call_update(might_resched, (typeof(&__cond_resched)) 
__static_call_return0);
+   static_call_update(preempt_schedule, 
(typeof(&preempt_schedule)) NULL);
+   static_call_update(preempt_schedule_notrace, 
(typeof(&preempt_schedule_notrace)) NULL);
+   static_call_update(irqentry_exit_cond_resched, 
(typeof(&irqentry_exit_cond_resched)) NULL);
+   pr_info("Dynamic Preempt: %s\n", str);
+   } else if (!strcmp(str, "voluntary")) {
+   static_call_update(cond_resched, __cond_resched);
+   static_call_update(might_resched, __cond_resched);
+   static_call_update(preempt_schedule, 
(typeof(&preempt_schedule)) NULL);
+   static_call_update(preempt_schedule_notrace, 
(typeof(&preempt_schedule_notrace)) NULL);
+   static_call_update(irqentry_exit_cond_resched, 
(typeof(&irqentry_exit_cond_resched)) NULL);
+   pr_info("Dynamic Preempt: %s\n", str);
+   } else if (!strcmp(str, "full")) {
+   static_call_update(cond_resched, (typeof(&__cond_resched)) 
__static_call_return0);
+   static_call_update(might_resched, (typeof(&__cond_resched)) 
__static_call_return0);
+   static_call_update(preempt_schedule, __preempt_schedule_func);
+   static_call_update(preempt_schedule_notrace, 
__preempt_schedule_notrace_func);
+   static_call_update(irqentry_exit_cond_resched, 
irqentry_exit_cond_resched);
+   pr_info("Dynamic Preempt: %s\n", str);
+   } else {
+   pr_warn("Dynamic Preempt: Unsupported preempt mode %s, default 
to full\n", str);
+   return 1;
+   }
+   return 0;
+}
+__setup("preempt=", setup_preempt_mode);
+
+#endif /* CONFIG_PREEMPT_DYNAMIC */
+
+
 /*
  * This is the entry point to schedule() from kernel preemption
  * off of irq context.


[tip: sched/core] preempt/dynamic: Provide preempt_schedule[_notrace]() static calls

2021-02-08 Thread tip-bot2 for Peter Zijlstra (Intel)
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 8c98e8cf723c3ab2ac924b0942dd3b8074f874e5
Gitweb:
https://git.kernel.org/tip/8c98e8cf723c3ab2ac924b0942dd3b8074f874e5
Author:Peter Zijlstra (Intel) 
AuthorDate:Mon, 18 Jan 2021 15:12:21 +01:00
Committer: Peter Zijlstra 
CommitterDate: Fri, 05 Feb 2021 17:19:57 +01:00

preempt/dynamic: Provide preempt_schedule[_notrace]() static calls

Provide static calls to control preempt_schedule[_notrace]()
(called in CONFIG_PREEMPT) so that we can override their behaviour when
preempt= is overriden.

Since the default behaviour is full preemption, both their calls are
initialized to the arch provided wrapper, if any.

[fweisbec: only define static calls when PREEMPT_DYNAMIC, make it less
   dependent on x86 with __preempt_schedule_func]
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20210118141223.123667-7-frede...@kernel.org
---
 arch/x86/include/asm/preempt.h | 34 +
 kernel/sched/core.c| 12 -
 2 files changed, 38 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/preempt.h b/arch/x86/include/asm/preempt.h
index 69485ca..9b12dce 100644
--- a/arch/x86/include/asm/preempt.h
+++ b/arch/x86/include/asm/preempt.h
@@ -5,6 +5,7 @@
 #include 
 #include 
 #include 
+#include 
 
 DECLARE_PER_CPU(int, __preempt_count);
 
@@ -103,16 +104,33 @@ static __always_inline bool should_resched(int 
preempt_offset)
 }
 
 #ifdef CONFIG_PREEMPTION
-  extern asmlinkage void preempt_schedule_thunk(void);
-# define __preempt_schedule() \
-   asm volatile ("call preempt_schedule_thunk" : ASM_CALL_CONSTRAINT)
 
-  extern asmlinkage void preempt_schedule(void);
-  extern asmlinkage void preempt_schedule_notrace_thunk(void);
-# define __preempt_schedule_notrace() \
-   asm volatile ("call preempt_schedule_notrace_thunk" : 
ASM_CALL_CONSTRAINT)
+extern asmlinkage void preempt_schedule(void);
+extern asmlinkage void preempt_schedule_thunk(void);
+
+#define __preempt_schedule_func preempt_schedule_thunk
+
+DECLARE_STATIC_CALL(preempt_schedule, __preempt_schedule_func);
+
+#define __preempt_schedule() \
+do { \
+   __ADDRESSABLE(STATIC_CALL_KEY(preempt_schedule)); \
+   asm volatile ("call " STATIC_CALL_TRAMP_STR(preempt_schedule) : 
ASM_CALL_CONSTRAINT); \
+} while (0)
+
+extern asmlinkage void preempt_schedule_notrace(void);
+extern asmlinkage void preempt_schedule_notrace_thunk(void);
+
+#define __preempt_schedule_notrace_func preempt_schedule_notrace_thunk
+
+DECLARE_STATIC_CALL(preempt_schedule_notrace, __preempt_schedule_notrace_func);
+
+#define __preempt_schedule_notrace() \
+do { \
+   __ADDRESSABLE(STATIC_CALL_KEY(preempt_schedule_notrace)); \
+   asm volatile ("call " STATIC_CALL_TRAMP_STR(preempt_schedule_notrace) : 
ASM_CALL_CONSTRAINT); \
+} while (0)
 
-  extern asmlinkage void preempt_schedule_notrace(void);
 #endif
 
 #endif /* __ASM_PREEMPT_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bc7b00b..cd0c46f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5251,6 +5251,12 @@ asmlinkage __visible void __sched notrace 
preempt_schedule(void)
 NOKPROBE_SYMBOL(preempt_schedule);
 EXPORT_SYMBOL(preempt_schedule);
 
+#ifdef CONFIG_PREEMPT_DYNAMIC
+DEFINE_STATIC_CALL(preempt_schedule, __preempt_schedule_func);
+EXPORT_STATIC_CALL(preempt_schedule);
+#endif
+
+
 /**
  * preempt_schedule_notrace - preempt_schedule called by tracing
  *
@@ -5303,6 +5309,12 @@ asmlinkage __visible void __sched notrace 
preempt_schedule_notrace(void)
 }
 EXPORT_SYMBOL_GPL(preempt_schedule_notrace);
 
+#ifdef CONFIG_PREEMPT_DYNAMIC
+DEFINE_STATIC_CALL(preempt_schedule_notrace, __preempt_schedule_notrace_func);
+EXPORT_STATIC_CALL(preempt_schedule_notrace);
+#endif
+
+
 #endif /* CONFIG_PREEMPTION */
 
 /*


[tip: sched/core] preempt/dynamic: Provide cond_resched() and might_resched() static calls

2021-02-08 Thread tip-bot2 for Peter Zijlstra (Intel)
The following commit has been merged into the sched/core branch of tip:

Commit-ID: bf3054bb801cf566e65e5f3d060435dbfa4a2f36
Gitweb:
https://git.kernel.org/tip/bf3054bb801cf566e65e5f3d060435dbfa4a2f36
Author:Peter Zijlstra (Intel) 
AuthorDate:Mon, 18 Jan 2021 15:12:20 +01:00
Committer: Peter Zijlstra 
CommitterDate: Fri, 05 Feb 2021 17:19:56 +01:00

preempt/dynamic: Provide cond_resched() and might_resched() static calls

Provide static calls to control cond_resched() (called in !CONFIG_PREEMPT)
and might_resched() (called in CONFIG_PREEMPT_VOLUNTARY) to that we
can override their behaviour when preempt= is overriden.

Since the default behaviour is full preemption, both their calls are
ignored when preempt= isn't passed.

[fweisbec: branch might_resched() directly to __cond_resched(), only
   define static calls when PREEMPT_DYNAMIC]
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20210118141223.123667-6-frede...@kernel.org
---
 include/linux/kernel.h | 23 +++
 include/linux/sched.h  | 27 ---
 kernel/sched/core.c| 16 +---
 3 files changed, 56 insertions(+), 10 deletions(-)

diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index f7902d8..cfd3d34 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -15,7 +15,7 @@
 #include 
 #include 
 #include 
-
+#include 
 #include 
 
 #include 
@@ -81,11 +81,26 @@ struct pt_regs;
 struct user;
 
 #ifdef CONFIG_PREEMPT_VOLUNTARY
-extern int _cond_resched(void);
-# define might_resched() _cond_resched()
+
+extern int __cond_resched(void);
+# define might_resched() __cond_resched()
+
+#elif defined(CONFIG_PREEMPT_DYNAMIC)
+
+extern int __cond_resched(void);
+
+DECLARE_STATIC_CALL(might_resched, __cond_resched);
+
+static __always_inline void might_resched(void)
+{
+   static_call(might_resched)();
+}
+
 #else
+
 # define might_resched() do { } while (0)
-#endif
+
+#endif /* CONFIG_PREEMPT_* */
 
 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
 extern void ___might_sleep(const char *file, int line, int preempt_offset);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index e115222..2f35594 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1871,11 +1871,32 @@ static inline int test_tsk_need_resched(struct 
task_struct *tsk)
  * value indicates whether a reschedule was done in fact.
  * cond_resched_lock() will drop the spinlock before scheduling,
  */
-#ifndef CONFIG_PREEMPTION
-extern int _cond_resched(void);
+#if !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC)
+extern int __cond_resched(void);
+
+#ifdef CONFIG_PREEMPT_DYNAMIC
+
+DECLARE_STATIC_CALL(cond_resched, __cond_resched);
+
+static __always_inline int _cond_resched(void)
+{
+   return static_call(cond_resched)();
+}
+
 #else
+
+static inline int _cond_resched(void)
+{
+   return __cond_resched();
+}
+
+#endif /* CONFIG_PREEMPT_DYNAMIC */
+
+#else
+
 static inline int _cond_resched(void) { return 0; }
-#endif
+
+#endif /* !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC) */
 
 #define cond_resched() ({  \
___might_sleep(__FILE__, __LINE__, 0);  \
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index be3a956..bc7b00b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6771,17 +6771,27 @@ SYSCALL_DEFINE0(sched_yield)
return 0;
 }
 
-#ifndef CONFIG_PREEMPTION
-int __sched _cond_resched(void)
+#if !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC)
+int __sched __cond_resched(void)
 {
if (should_resched(0)) {
preempt_schedule_common();
return 1;
}
+#ifndef CONFIG_PREEMPT_RCU
rcu_all_qs();
+#endif
return 0;
 }
-EXPORT_SYMBOL(_cond_resched);
+EXPORT_SYMBOL(__cond_resched);
+#endif
+
+#ifdef CONFIG_PREEMPT_DYNAMIC
+DEFINE_STATIC_CALL_RET0(cond_resched, __cond_resched);
+EXPORT_STATIC_CALL(cond_resched);
+
+DEFINE_STATIC_CALL_RET0(might_resched, __cond_resched);
+EXPORT_STATIC_CALL(might_resched);
 #endif
 
 /*


[tip: sched/core] preempt/dynamic: Support dynamic preempt with preempt= boot option

2021-02-08 Thread tip-bot2 for Peter Zijlstra (Intel)
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 4f8a0041162ab74ba5f89ee58f8aad59c4c85d22
Gitweb:
https://git.kernel.org/tip/4f8a0041162ab74ba5f89ee58f8aad59c4c85d22
Author:Peter Zijlstra (Intel) 
AuthorDate:Mon, 18 Jan 2021 15:12:23 +01:00
Committer: Peter Zijlstra 
CommitterDate: Fri, 05 Feb 2021 17:19:58 +01:00

preempt/dynamic: Support dynamic preempt with preempt= boot option

Support the preempt= boot option and patch the static call sites
accordingly.

Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20210118141223.123667-9-frede...@kernel.org
---
 kernel/sched/core.c | 67 +++-
 1 file changed, 66 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index cd0c46f..0562992 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -12,6 +12,7 @@
 
 #include "sched.h"
 
+#include 
 #include 
 
 #include 
@@ -5314,9 +5315,73 @@ DEFINE_STATIC_CALL(preempt_schedule_notrace, 
__preempt_schedule_notrace_func);
 EXPORT_STATIC_CALL(preempt_schedule_notrace);
 #endif
 
-
 #endif /* CONFIG_PREEMPTION */
 
+#ifdef CONFIG_PREEMPT_DYNAMIC
+
+/*
+ * SC:cond_resched
+ * SC:might_resched
+ * SC:preempt_schedule
+ * SC:preempt_schedule_notrace
+ * SC:irqentry_exit_cond_resched
+ *
+ *
+ * NONE:
+ *   cond_resched   <- __cond_resched
+ *   might_resched  <- RET0
+ *   preempt_schedule   <- NOP
+ *   preempt_schedule_notrace   <- NOP
+ *   irqentry_exit_cond_resched <- NOP
+ *
+ * VOLUNTARY:
+ *   cond_resched   <- __cond_resched
+ *   might_resched  <- __cond_resched
+ *   preempt_schedule   <- NOP
+ *   preempt_schedule_notrace   <- NOP
+ *   irqentry_exit_cond_resched <- NOP
+ *
+ * FULL:
+ *   cond_resched   <- RET0
+ *   might_resched  <- RET0
+ *   preempt_schedule   <- preempt_schedule
+ *   preempt_schedule_notrace   <- preempt_schedule_notrace
+ *   irqentry_exit_cond_resched <- irqentry_exit_cond_resched
+ */
+static int __init setup_preempt_mode(char *str)
+{
+   if (!strcmp(str, "none")) {
+   static_call_update(cond_resched, __cond_resched);
+   static_call_update(might_resched, (typeof(&__cond_resched)) 
__static_call_return0);
+   static_call_update(preempt_schedule, 
(typeof(&preempt_schedule)) NULL);
+   static_call_update(preempt_schedule_notrace, 
(typeof(&preempt_schedule_notrace)) NULL);
+   static_call_update(irqentry_exit_cond_resched, 
(typeof(&irqentry_exit_cond_resched)) NULL);
+   pr_info("Dynamic Preempt: %s\n", str);
+   } else if (!strcmp(str, "voluntary")) {
+   static_call_update(cond_resched, __cond_resched);
+   static_call_update(might_resched, __cond_resched);
+   static_call_update(preempt_schedule, 
(typeof(&preempt_schedule)) NULL);
+   static_call_update(preempt_schedule_notrace, 
(typeof(&preempt_schedule_notrace)) NULL);
+   static_call_update(irqentry_exit_cond_resched, 
(typeof(&irqentry_exit_cond_resched)) NULL);
+   pr_info("Dynamic Preempt: %s\n", str);
+   } else if (!strcmp(str, "full")) {
+   static_call_update(cond_resched, (typeof(&__cond_resched)) 
__static_call_return0);
+   static_call_update(might_resched, (typeof(&__cond_resched)) 
__static_call_return0);
+   static_call_update(preempt_schedule, __preempt_schedule_func);
+   static_call_update(preempt_schedule_notrace, 
__preempt_schedule_notrace_func);
+   static_call_update(irqentry_exit_cond_resched, 
irqentry_exit_cond_resched);
+   pr_info("Dynamic Preempt: %s\n", str);
+   } else {
+   pr_warn("Dynamic Preempt: Unsupported preempt mode %s, default 
to full\n", str);
+   return 1;
+   }
+   return 0;
+}
+__setup("preempt=", setup_preempt_mode);
+
+#endif /* CONFIG_PREEMPT_DYNAMIC */
+
+
 /*
  * This is the entry point to schedule() from kernel preemption
  * off of irq context.


[tip: sched/core] preempt/dynamic: Provide irqentry_exit_cond_resched() static call

2021-02-08 Thread tip-bot2 for Peter Zijlstra (Intel)
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 74345075999752a7a9c805fe5e2ec770345cd1ca
Gitweb:
https://git.kernel.org/tip/74345075999752a7a9c805fe5e2ec770345cd1ca
Author:Peter Zijlstra (Intel) 
AuthorDate:Mon, 18 Jan 2021 15:12:22 +01:00
Committer: Peter Zijlstra 
CommitterDate: Fri, 05 Feb 2021 17:19:57 +01:00

preempt/dynamic: Provide irqentry_exit_cond_resched() static call

Provide static call to control IRQ preemption (called in CONFIG_PREEMPT)
so that we can override its behaviour when preempt= is overriden.

Since the default behaviour is full preemption, its call is
initialized to provide IRQ preemption when preempt= isn't passed.

Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20210118141223.123667-8-frede...@kernel.org
---
 include/linux/entry-common.h |  4 
 kernel/entry/common.c| 10 +-
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index ca86a00..1401c93 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -2,6 +2,7 @@
 #ifndef __LINUX_ENTRYCOMMON_H
 #define __LINUX_ENTRYCOMMON_H
 
+#include 
 #include 
 #include 
 #include 
@@ -453,6 +454,9 @@ irqentry_state_t noinstr irqentry_enter(struct pt_regs 
*regs);
  * Conditional reschedule with additional sanity checks.
  */
 void irqentry_exit_cond_resched(void);
+#ifdef CONFIG_PREEMPT_DYNAMIC
+DECLARE_STATIC_CALL(irqentry_exit_cond_resched, irqentry_exit_cond_resched);
+#endif
 
 /**
  * irqentry_exit - Handle return from exception that used irqentry_enter()
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 3783416..84fa7ec 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -393,6 +393,9 @@ void irqentry_exit_cond_resched(void)
preempt_schedule_irq();
}
 }
+#ifdef CONFIG_PREEMPT_DYNAMIC
+DEFINE_STATIC_CALL(irqentry_exit_cond_resched, irqentry_exit_cond_resched);
+#endif
 
 noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
 {
@@ -419,8 +422,13 @@ noinstr void irqentry_exit(struct pt_regs *regs, 
irqentry_state_t state)
}
 
instrumentation_begin();
-   if (IS_ENABLED(CONFIG_PREEMPTION))
+   if (IS_ENABLED(CONFIG_PREEMPTION)) {
+#ifdef CONFIG_PREEMT_DYNAMIC
+   static_call(irqentry_exit_cond_resched)();
+#else
irqentry_exit_cond_resched();
+#endif
+   }
/* Covers both tracing and lockdep */
trace_hardirqs_on();
instrumentation_end();


[tip:perf/urgent] perf/x86: Fix embarrasing typo

2014-11-04 Thread tip-bot for Peter Zijlstra (Intel)
Commit-ID:  ce5686d4ed12158599d2042a6c8659254ed263ce
Gitweb: http://git.kernel.org/tip/ce5686d4ed12158599d2042a6c8659254ed263ce
Author: Peter Zijlstra (Intel) 
AuthorDate: Wed, 29 Oct 2014 11:17:04 +0100
Committer:  Ingo Molnar 
CommitDate: Tue, 4 Nov 2014 07:06:58 +0100

perf/x86: Fix embarrasing typo

Because we're all human and typing sucks..

Fixes: 7fb0f1de49fc ("perf/x86: Fix compile warnings for intel_uncore")
Reported-by: Andi Kleen 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: x...@kernel.org
Link: http://lkml.kernel.org/n/tip-be0bftjh8yfm4uvmvtf3y...@git.kernel.org
Signed-off-by: Ingo Molnar 
---
 arch/x86/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index ded8a67..41a503c 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -144,7 +144,7 @@ config INSTRUCTION_DECODER
 
 config PERF_EVENTS_INTEL_UNCORE
def_bool y
-   depends on PERF_EVENTS && SUP_SUP_INTEL && PCI
+   depends on PERF_EVENTS && CPU_SUP_INTEL && PCI
 
 config OUTPUT_FORMAT
string
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[tip:perf/urgent] perf: Fix bogus kernel printk

2014-10-28 Thread tip-bot for Peter Zijlstra (Intel)
Commit-ID:  65d71fe1375b973083733294795bf2b09d45b3c2
Gitweb: http://git.kernel.org/tip/65d71fe1375b973083733294795bf2b09d45b3c2
Author: Peter Zijlstra (Intel) 
AuthorDate: Tue, 7 Oct 2014 19:07:33 +0200
Committer:  Ingo Molnar 
CommitDate: Tue, 28 Oct 2014 10:51:01 +0100

perf: Fix bogus kernel printk

Andy spotted the fail in what was intended as a conditional printk level.

Reported-by: Andy Lutomirski 
Fixes: cc6cd47e7395 ("perf/x86: Tone down kernel messages when the PMU check 
fails in a virtual environment")
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Arnaldo Carvalho de Melo 
Cc: Linus Torvalds 
Link: 
http://lkml.kernel.org/r/20141007124757.gh19...@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar 
---
 arch/x86/kernel/cpu/perf_event.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 1b8299d..66451a6 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -243,8 +243,9 @@ static bool check_hw_exists(void)
 
 msr_fail:
printk(KERN_CONT "Broken PMU hardware detected, using software events 
only.\n");
-   printk(boot_cpu_has(X86_FEATURE_HYPERVISOR) ? KERN_INFO : KERN_ERR
-  "Failed to access perfctr msr (MSR %x is %Lx)\n", reg, val_new);
+   printk("%sFailed to access perfctr msr (MSR %x is %Lx)\n",
+   boot_cpu_has(X86_FEATURE_HYPERVISOR) ? KERN_INFO : KERN_ERR,
+   reg, val_new);
 
return false;
 }
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[tip:smp/hotplug] sched: Allow per-cpu kernel threads to run on online && !active

2016-05-05 Thread tip-bot for Peter Zijlstra (Intel)
Commit-ID:  618d6e31623149c6203b46850e2e76ee0f29e577
Gitweb: http://git.kernel.org/tip/618d6e31623149c6203b46850e2e76ee0f29e577
Author: Peter Zijlstra (Intel) 
AuthorDate: Thu, 10 Mar 2016 12:54:08 +0100
Committer:  Thomas Gleixner 
CommitDate: Thu, 5 May 2016 13:17:52 +0200

sched: Allow per-cpu kernel threads to run on online && !active

In order to enable symmetric hotplug, we must mirror the online &&
!active state of cpu-down on the cpu-up side.

However, to retain sanity, limit this state to per-cpu kthreads.

Aside from the change to set_cpus_allowed_ptr(), which allow moving
the per-cpu kthreads on, the other critical piece is the cpu selection
for pinned tasks in select_task_rq(). This avoids dropping into
select_fallback_rq().

select_fallback_rq() cannot be allowed to select !active cpus because
its used to migrate user tasks away. And we do not want to move user
tasks onto cpus that are in transition.

Requested-by: Thomas Gleixner 
Signed-off-by: Peter Zijlstra (Intel) 
Tested-by: Thomas Gleixner 
Cc: Lai Jiangshan 
Cc: Jan H. Schönherr 
Cc: Oleg Nesterov 
Cc: r...@linutronix.de
Link: 
http://lkml.kernel.org/r/20160301152303.gv6...@twins.programming.kicks-ass.net
Signed-off-by: Thomas Gleixner 

---
 arch/powerpc/kernel/smp.c |  2 +-
 arch/s390/kernel/smp.c|  2 +-
 include/linux/cpumask.h   |  6 ++
 kernel/sched/core.c   | 49 ---
 4 files changed, 46 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 8cac1eb..55c924b 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -565,7 +565,7 @@ int __cpu_up(unsigned int cpu, struct task_struct *tidle)
smp_ops->give_timebase();
 
/* Wait until cpu puts itself in the online & active maps */
-   while (!cpu_online(cpu) || !cpu_active(cpu))
+   while (!cpu_online(cpu))
cpu_relax();
 
return 0;
diff --git a/arch/s390/kernel/smp.c b/arch/s390/kernel/smp.c
index 40a6b4f..7b89a75 100644
--- a/arch/s390/kernel/smp.c
+++ b/arch/s390/kernel/smp.c
@@ -832,7 +832,7 @@ int __cpu_up(unsigned int cpu, struct task_struct *tidle)
pcpu_attach_task(pcpu, tidle);
pcpu_start_fn(pcpu, smp_start_secondary, NULL);
/* Wait until cpu puts itself in the online & active maps */
-   while (!cpu_online(cpu) || !cpu_active(cpu))
+   while (!cpu_online(cpu))
cpu_relax();
return 0;
 }
diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 40cee6b..e828cf6 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -743,12 +743,10 @@ set_cpu_present(unsigned int cpu, bool present)
 static inline void
 set_cpu_online(unsigned int cpu, bool online)
 {
-   if (online) {
+   if (online)
cpumask_set_cpu(cpu, &__cpu_online_mask);
-   cpumask_set_cpu(cpu, &__cpu_active_mask);
-   } else {
+   else
cpumask_clear_cpu(cpu, &__cpu_online_mask);
-   }
 }
 
 static inline void
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8b489fc..8bfd7d4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1082,13 +1082,21 @@ void do_set_cpus_allowed(struct task_struct *p, const 
struct cpumask *new_mask)
 static int __set_cpus_allowed_ptr(struct task_struct *p,
  const struct cpumask *new_mask, bool check)
 {
+   const struct cpumask *cpu_valid_mask = cpu_active_mask;
+   unsigned int dest_cpu;
unsigned long flags;
struct rq *rq;
-   unsigned int dest_cpu;
int ret = 0;
 
rq = task_rq_lock(p, &flags);
 
+   if (p->flags & PF_KTHREAD) {
+   /*
+* Kernel threads are allowed on online && !active CPUs
+*/
+   cpu_valid_mask = cpu_online_mask;
+   }
+
/*
 * Must re-check here, to close a race against __kthread_bind(),
 * sched_setaffinity() is not guaranteed to observe the flag.
@@ -1101,18 +1109,28 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
if (cpumask_equal(&p->cpus_allowed, new_mask))
goto out;
 
-   if (!cpumask_intersects(new_mask, cpu_active_mask)) {
+   if (!cpumask_intersects(new_mask, cpu_valid_mask)) {
ret = -EINVAL;
goto out;
}
 
do_set_cpus_allowed(p, new_mask);
 
+   if (p->flags & PF_KTHREAD) {
+   /*
+* For kernel threads that do indeed end up on online &&
+* !active we want to ensure they are strict per-cpu threads.
+*/
+   WARN_ON(cpumask_intersects(new_mask, cpu_online_mask) &&
+   !cpumask_intersects(new_mask, cpu_active_mask) &&
+   p->nr_cpus_allowed != 1);
+   }
+
 

[tip:smp/hotplug] sched: Allow per-cpu kernel threads to run on online && !active

2016-05-06 Thread tip-bot for Peter Zijlstra (Intel)
Commit-ID:  e9d867a67fd03ccc07248ca4e9c2f74fed494d5b
Gitweb: http://git.kernel.org/tip/e9d867a67fd03ccc07248ca4e9c2f74fed494d5b
Author: Peter Zijlstra (Intel) 
AuthorDate: Thu, 10 Mar 2016 12:54:08 +0100
Committer:  Thomas Gleixner 
CommitDate: Fri, 6 May 2016 14:58:22 +0200

sched: Allow per-cpu kernel threads to run on online && !active

In order to enable symmetric hotplug, we must mirror the online &&
!active state of cpu-down on the cpu-up side.

However, to retain sanity, limit this state to per-cpu kthreads.

Aside from the change to set_cpus_allowed_ptr(), which allow moving
the per-cpu kthreads on, the other critical piece is the cpu selection
for pinned tasks in select_task_rq(). This avoids dropping into
select_fallback_rq().

select_fallback_rq() cannot be allowed to select !active cpus because
its used to migrate user tasks away. And we do not want to move user
tasks onto cpus that are in transition.

Requested-by: Thomas Gleixner 
Signed-off-by: Peter Zijlstra (Intel) 
Tested-by: Thomas Gleixner 
Cc: Lai Jiangshan 
Cc: Jan H. Schönherr 
Cc: Oleg Nesterov 
Cc: r...@linutronix.de
Link: 
http://lkml.kernel.org/r/20160301152303.gv6...@twins.programming.kicks-ass.net
Signed-off-by: Thomas Gleixner 
---
 arch/powerpc/kernel/smp.c |  2 +-
 arch/s390/kernel/smp.c|  2 +-
 include/linux/cpumask.h   |  6 ++
 kernel/sched/core.c   | 49 ---
 4 files changed, 46 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 8cac1eb..55c924b 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -565,7 +565,7 @@ int __cpu_up(unsigned int cpu, struct task_struct *tidle)
smp_ops->give_timebase();
 
/* Wait until cpu puts itself in the online & active maps */
-   while (!cpu_online(cpu) || !cpu_active(cpu))
+   while (!cpu_online(cpu))
cpu_relax();
 
return 0;
diff --git a/arch/s390/kernel/smp.c b/arch/s390/kernel/smp.c
index 40a6b4f..7b89a75 100644
--- a/arch/s390/kernel/smp.c
+++ b/arch/s390/kernel/smp.c
@@ -832,7 +832,7 @@ int __cpu_up(unsigned int cpu, struct task_struct *tidle)
pcpu_attach_task(pcpu, tidle);
pcpu_start_fn(pcpu, smp_start_secondary, NULL);
/* Wait until cpu puts itself in the online & active maps */
-   while (!cpu_online(cpu) || !cpu_active(cpu))
+   while (!cpu_online(cpu))
cpu_relax();
return 0;
 }
diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 40cee6b..e828cf6 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -743,12 +743,10 @@ set_cpu_present(unsigned int cpu, bool present)
 static inline void
 set_cpu_online(unsigned int cpu, bool online)
 {
-   if (online) {
+   if (online)
cpumask_set_cpu(cpu, &__cpu_online_mask);
-   cpumask_set_cpu(cpu, &__cpu_active_mask);
-   } else {
+   else
cpumask_clear_cpu(cpu, &__cpu_online_mask);
-   }
 }
 
 static inline void
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8b489fc..8bfd7d4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1082,13 +1082,21 @@ void do_set_cpus_allowed(struct task_struct *p, const 
struct cpumask *new_mask)
 static int __set_cpus_allowed_ptr(struct task_struct *p,
  const struct cpumask *new_mask, bool check)
 {
+   const struct cpumask *cpu_valid_mask = cpu_active_mask;
+   unsigned int dest_cpu;
unsigned long flags;
struct rq *rq;
-   unsigned int dest_cpu;
int ret = 0;
 
rq = task_rq_lock(p, &flags);
 
+   if (p->flags & PF_KTHREAD) {
+   /*
+* Kernel threads are allowed on online && !active CPUs
+*/
+   cpu_valid_mask = cpu_online_mask;
+   }
+
/*
 * Must re-check here, to close a race against __kthread_bind(),
 * sched_setaffinity() is not guaranteed to observe the flag.
@@ -1101,18 +1109,28 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
if (cpumask_equal(&p->cpus_allowed, new_mask))
goto out;
 
-   if (!cpumask_intersects(new_mask, cpu_active_mask)) {
+   if (!cpumask_intersects(new_mask, cpu_valid_mask)) {
ret = -EINVAL;
goto out;
}
 
do_set_cpus_allowed(p, new_mask);
 
+   if (p->flags & PF_KTHREAD) {
+   /*
+* For kernel threads that do indeed end up on online &&
+* !active we want to ensure they are strict per-cpu threads.
+*/
+   WARN_ON(cpumask_intersects(new_mask, cpu_online_mask) &&
+   !cpumask_intersects(new_mask, cpu_active_mask) &&
+   p->nr_cpus_allowed != 1);
+   }
+
 

[tip:perf/core] perf/core: Rename perf_event_read_{one,group}, perf_read_hw

2015-09-13 Thread tip-bot for Peter Zijlstra (Intel)
Commit-ID:  b15f495b4e9295cf21065d8569835a2f18cfe41b
Gitweb: http://git.kernel.org/tip/b15f495b4e9295cf21065d8569835a2f18cfe41b
Author: Peter Zijlstra (Intel) 
AuthorDate: Thu, 3 Sep 2015 20:07:47 -0700
Committer:  Ingo Molnar 
CommitDate: Sun, 13 Sep 2015 11:27:26 +0200

perf/core: Rename perf_event_read_{one,group}, perf_read_hw

In order to free up the perf_event_read_group() name:

 s/perf_event_read_\(one\|group\)/perf_read_\1/g
 s/perf_read_hw/__perf_read/g

Signed-off-by: Peter Zijlstra (Intel) 
Cc: Arnaldo Carvalho de Melo 
Cc: Arnaldo Carvalho de Melo 
Cc: Jiri Olsa 
Cc: Linus Torvalds 
Cc: Michael Ellerman 
Cc: Peter Zijlstra 
Cc: Stephane Eranian 
Cc: Thomas Gleixner 
Cc: Vince Weaver 
Link: 
http://lkml.kernel.org/r/1441336073-22750-5-git-send-email-suka...@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar 
---
 kernel/events/core.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 260bf8c..67b7dba 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3742,7 +3742,7 @@ static void put_event(struct perf_event *event)
 * see the comment there.
 *
 *  2) there is a lock-inversion with mmap_sem through
-* perf_event_read_group(), which takes faults while
+* perf_read_group(), which takes faults while
 * holding ctx->mutex, however this is called after
 * the last filedesc died, so there is no possibility
 * to trigger the AB-BA case.
@@ -3837,7 +3837,7 @@ u64 perf_event_read_value(struct perf_event *event, u64 
*enabled, u64 *running)
 }
 EXPORT_SYMBOL_GPL(perf_event_read_value);
 
-static int perf_event_read_group(struct perf_event *event,
+static int perf_read_group(struct perf_event *event,
   u64 read_format, char __user *buf)
 {
struct perf_event *leader = event->group_leader, *sub;
@@ -3885,7 +3885,7 @@ static int perf_event_read_group(struct perf_event *event,
return ret;
 }
 
-static int perf_event_read_one(struct perf_event *event,
+static int perf_read_one(struct perf_event *event,
 u64 read_format, char __user *buf)
 {
u64 enabled, running;
@@ -3923,7 +3923,7 @@ static bool is_event_hup(struct perf_event *event)
  * Read the performance event - simple non blocking version for now
  */
 static ssize_t
-perf_read_hw(struct perf_event *event, char __user *buf, size_t count)
+__perf_read(struct perf_event *event, char __user *buf, size_t count)
 {
u64 read_format = event->attr.read_format;
int ret;
@@ -3941,9 +3941,9 @@ perf_read_hw(struct perf_event *event, char __user *buf, 
size_t count)
 
WARN_ON_ONCE(event->ctx->parent_ctx);
if (read_format & PERF_FORMAT_GROUP)
-   ret = perf_event_read_group(event, read_format, buf);
+   ret = perf_read_group(event, read_format, buf);
else
-   ret = perf_event_read_one(event, read_format, buf);
+   ret = perf_read_one(event, read_format, buf);
 
return ret;
 }
@@ -3956,7 +3956,7 @@ perf_read(struct file *file, char __user *buf, size_t 
count, loff_t *ppos)
int ret;
 
ctx = perf_event_ctx_lock(event);
-   ret = perf_read_hw(event, buf, count);
+   ret = __perf_read(event, buf, count);
perf_event_ctx_unlock(event, ctx);
 
return ret;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[tip:sched/core] wait.[ch]: Introduce the simple waitqueue (swait) implementation

2016-02-25 Thread tip-bot for Peter Zijlstra (Intel)
Commit-ID:  13b35686e8b934ff78f59cef0c65fa3a43f8eeaf
Gitweb: http://git.kernel.org/tip/13b35686e8b934ff78f59cef0c65fa3a43f8eeaf
Author: Peter Zijlstra (Intel) 
AuthorDate: Fri, 19 Feb 2016 09:46:37 +0100
Committer:  Thomas Gleixner 
CommitDate: Thu, 25 Feb 2016 11:27:16 +0100

wait.[ch]: Introduce the simple waitqueue (swait) implementation

The existing wait queue support has support for custom wake up call
backs, wake flags, wake key (passed to call back) and exclusive
flags that allow wakers to be tagged as exclusive, for limiting
the number of wakers.

In a lot of cases, none of these features are used, and hence we
can benefit from a slimmed down version that lowers memory overhead
and reduces runtime overhead.

The concept originated from -rt, where waitqueues are a constant
source of trouble, as we can't convert the head lock to a raw
spinlock due to fancy and long lasting callbacks.

With the removal of custom callbacks, we can use a raw lock for
queue list manipulations, hence allowing the simple wait support
to be used in -rt.

[Patch is from PeterZ which is based on Thomas version. Commit message is
 written by Paul G.
 Daniel:  - Fixed some compile issues
  - Added non-lazy implementation of swake_up_locked as suggested
 by Boqun Feng.]

Originally-by: Thomas Gleixner 
Signed-off-by: Daniel Wagner 
Acked-by: Peter Zijlstra (Intel) 
Cc: linux-rt-us...@vger.kernel.org
Cc: Boqun Feng 
Cc: Marcelo Tosatti 
Cc: Steven Rostedt 
Cc: Paul Gortmaker 
Cc: Paolo Bonzini 
Cc: "Paul E. McKenney" 
Link: http://lkml.kernel.org/r/1455871601-27484-2-git-send-email-w...@monom.org
Signed-off-by: Thomas Gleixner 
---
 include/linux/swait.h | 172 ++
 kernel/sched/Makefile |   2 +-
 kernel/sched/swait.c  | 123 
 3 files changed, 296 insertions(+), 1 deletion(-)

diff --git a/include/linux/swait.h b/include/linux/swait.h
new file mode 100644
index 000..c1f9c62
--- /dev/null
+++ b/include/linux/swait.h
@@ -0,0 +1,172 @@
+#ifndef _LINUX_SWAIT_H
+#define _LINUX_SWAIT_H
+
+#include 
+#include 
+#include 
+#include 
+
+/*
+ * Simple wait queues
+ *
+ * While these are very similar to the other/complex wait queues (wait.h) the
+ * most important difference is that the simple waitqueue allows for
+ * deterministic behaviour -- IOW it has strictly bounded IRQ and lock hold
+ * times.
+ *
+ * In order to make this so, we had to drop a fair number of features of the
+ * other waitqueue code; notably:
+ *
+ *  - mixing INTERRUPTIBLE and UNINTERRUPTIBLE sleeps on the same waitqueue;
+ *all wakeups are TASK_NORMAL in order to avoid O(n) lookups for the right
+ *sleeper state.
+ *
+ *  - the exclusive mode; because this requires preserving the list order
+ *and this is hard.
+ *
+ *  - custom wake functions; because you cannot give any guarantees about
+ *random code.
+ *
+ * As a side effect of this; the data structures are slimmer.
+ *
+ * One would recommend using this wait queue where possible.
+ */
+
+struct task_struct;
+
+struct swait_queue_head {
+   raw_spinlock_t  lock;
+   struct list_headtask_list;
+};
+
+struct swait_queue {
+   struct task_struct  *task;
+   struct list_headtask_list;
+};
+
+#define __SWAITQUEUE_INITIALIZER(name) {   \
+   .task   = current,  \
+   .task_list  = LIST_HEAD_INIT((name).task_list), \
+}
+
+#define DECLARE_SWAITQUEUE(name)   \
+   struct swait_queue name = __SWAITQUEUE_INITIALIZER(name)
+
+#define __SWAIT_QUEUE_HEAD_INITIALIZER(name) { \
+   .lock   = __RAW_SPIN_LOCK_UNLOCKED(name.lock),  \
+   .task_list  = LIST_HEAD_INIT((name).task_list), \
+}
+
+#define DECLARE_SWAIT_QUEUE_HEAD(name) \
+   struct swait_queue_head name = __SWAIT_QUEUE_HEAD_INITIALIZER(name)
+
+extern void __init_swait_queue_head(struct swait_queue_head *q, const char 
*name,
+   struct lock_class_key *key);
+
+#define init_swait_queue_head(q)   \
+   do {\
+   static struct lock_class_key __key; \
+   __init_swait_queue_head((q), #q, &__key);   \
+   } while (0)
+
+#ifdef CONFIG_LOCKDEP
+# define __SWAIT_QUEUE_HEAD_INIT_ONSTACK(name) \
+   ({ init_swait_queue_head(&name); name; })
+# define DECLARE_SWAIT_QUEUE_HEAD_ONSTACK(name)\
+   struct swait_queue_head name = __SWAIT_QUEUE_HEAD_INIT_ONSTACK(name)
+#else
+# define DECLARE_SWAIT_QUEUE_HEAD_ONSTACK(name)\
+   DECLARE_SWAIT_QUEUE_HEAD(name)
+#endif
+
+static inline int swait_active(struct swait_queue_hea

[tip:locking/core] locking/qspinlock: Add pending bit

2015-05-08 Thread tip-bot for Peter Zijlstra (Intel)
Commit-ID:  c1fb159db9f2e50e0f4025bed92a67a6a7bfa7b7
Gitweb: http://git.kernel.org/tip/c1fb159db9f2e50e0f4025bed92a67a6a7bfa7b7
Author: Peter Zijlstra (Intel) 
AuthorDate: Fri, 24 Apr 2015 14:56:32 -0400
Committer:  Ingo Molnar 
CommitDate: Fri, 8 May 2015 12:36:32 +0200

locking/qspinlock: Add pending bit

Because the qspinlock needs to touch a second cacheline (the per-cpu
mcs_nodes[]); add a pending bit and allow a single in-word spinner
before we punt to the second cacheline.

It is possible so observe the pending bit without the locked bit when
the last owner has just released but the pending owner has not yet
taken ownership.

In this case we would normally queue -- because the pending bit is
already taken. However, in this case the pending bit is guaranteed
to be released 'soon', therefore wait for it and avoid queueing.

Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Waiman Long 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Andrew Morton 
Cc: Boris Ostrovsky 
Cc: Borislav Petkov 
Cc: Daniel J Blueman 
Cc: David Vrabel 
Cc: Douglas Hatch 
Cc: H. Peter Anvin 
Cc: Konrad Rzeszutek Wilk 
Cc: Linus Torvalds 
Cc: Oleg Nesterov 
Cc: Paolo Bonzini 
Cc: Paul E. McKenney 
Cc: Peter Zijlstra 
Cc: Raghavendra K T 
Cc: Rik van Riel 
Cc: Scott J Norton 
Cc: Thomas Gleixner 
Cc: virtualizat...@lists.linux-foundation.org
Cc: xen-de...@lists.xenproject.org
Link: 
http://lkml.kernel.org/r/1429901803-29771-4-git-send-email-waiman.l...@hp.com
Signed-off-by: Ingo Molnar 
---
 include/asm-generic/qspinlock_types.h |  12 +++-
 kernel/locking/qspinlock.c| 119 --
 2 files changed, 107 insertions(+), 24 deletions(-)

diff --git a/include/asm-generic/qspinlock_types.h 
b/include/asm-generic/qspinlock_types.h
index aec05c7..7ee6632 100644
--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -36,8 +36,9 @@ typedef struct qspinlock {
  * Bitfields in the atomic value:
  *
  *  0- 7: locked byte
- *  8- 9: tail index
- * 10-31: tail cpu (+1)
+ * 8: pending
+ *  9-10: tail index
+ * 11-31: tail cpu (+1)
  */
 #define_Q_SET_MASK(type)   (((1U << _Q_ ## type ## _BITS) - 1)\
  << _Q_ ## type ## _OFFSET)
@@ -45,7 +46,11 @@ typedef struct qspinlock {
 #define _Q_LOCKED_BITS 8
 #define _Q_LOCKED_MASK _Q_SET_MASK(LOCKED)
 
-#define _Q_TAIL_IDX_OFFSET (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
+#define _Q_PENDING_OFFSET  (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
+#define _Q_PENDING_BITS1
+#define _Q_PENDING_MASK_Q_SET_MASK(PENDING)
+
+#define _Q_TAIL_IDX_OFFSET (_Q_PENDING_OFFSET + _Q_PENDING_BITS)
 #define _Q_TAIL_IDX_BITS   2
 #define _Q_TAIL_IDX_MASK   _Q_SET_MASK(TAIL_IDX)
 
@@ -54,5 +59,6 @@ typedef struct qspinlock {
 #define _Q_TAIL_CPU_MASK   _Q_SET_MASK(TAIL_CPU)
 
 #define _Q_LOCKED_VAL  (1U << _Q_LOCKED_OFFSET)
+#define _Q_PENDING_VAL (1U << _Q_PENDING_OFFSET)
 
 #endif /* __ASM_GENERIC_QSPINLOCK_TYPES_H */
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 029b51c..af9c2ef 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -94,24 +94,28 @@ static inline struct mcs_spinlock *decode_tail(u32 tail)
return per_cpu_ptr(&mcs_nodes[idx], cpu);
 }
 
+#define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
+
 /**
  * queued_spin_lock_slowpath - acquire the queued spinlock
  * @lock: Pointer to queued spinlock structure
  * @val: Current value of the queued spinlock 32-bit word
  *
- * (queue tail, lock value)
- *
- *  fast  :slow  :
unlock
- *:  :
- * uncontended  (0,0)   --:--> (0,1) :--> (*,0)
- *:   | ^./  :
- *:   v   \   |  :
- * uncontended:(n,x) --+--> (n,0) |  :
- *   queue:   | ^--'  |  :
- *:   v   |  :
- * contended  :(*,x) --+--> (*,0) -> (*,1) ---'  :
- *   queue: ^--' :
+ * (queue tail, pending bit, lock value)
  *
+ *  fast :slow  :unlock
+ *   :  :
+ * uncontended  (0,0,0) -:--> (0,0,1) --:--> 
(*,*,0)
+ *   :   | ^.--. /  :
+ *   :   v   \  \|  :
+ * pending   :(0,1,1) +--> (0,1,0)   \   |  :
+ *   :  

[tip:locking/core] locking/qspinlock: Optimize for smaller NR_CPUS

2015-05-08 Thread tip-bot for Peter Zijlstra (Intel)
Commit-ID:  69f9cae90907e09af95fb991ed384670cef8dd32
Gitweb: http://git.kernel.org/tip/69f9cae90907e09af95fb991ed384670cef8dd32
Author: Peter Zijlstra (Intel) 
AuthorDate: Fri, 24 Apr 2015 14:56:34 -0400
Committer:  Ingo Molnar 
CommitDate: Fri, 8 May 2015 12:36:48 +0200

locking/qspinlock: Optimize for smaller NR_CPUS

When we allow for a max NR_CPUS < 2^14 we can optimize the pending
wait-acquire and the xchg_tail() operations.

By growing the pending bit to a byte, we reduce the tail to 16bit.
This means we can use xchg16 for the tail part and do away with all
the repeated compxchg() operations.

This in turn allows us to unconditionally acquire; the locked state
as observed by the wait loops cannot change. And because both locked
and pending are now a full byte we can use simple stores for the
state transition, obviating one atomic operation entirely.

This optimization is needed to make the qspinlock achieve performance
parity with ticket spinlock at light load.

All this is horribly broken on Alpha pre EV56 (and any other arch that
cannot do single-copy atomic byte stores).

Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Waiman Long 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Andrew Morton 
Cc: Boris Ostrovsky 
Cc: Borislav Petkov 
Cc: Daniel J Blueman 
Cc: David Vrabel 
Cc: Douglas Hatch 
Cc: H. Peter Anvin 
Cc: Konrad Rzeszutek Wilk 
Cc: Linus Torvalds 
Cc: Oleg Nesterov 
Cc: Paolo Bonzini 
Cc: Paul E. McKenney 
Cc: Peter Zijlstra 
Cc: Raghavendra K T 
Cc: Rik van Riel 
Cc: Scott J Norton 
Cc: Thomas Gleixner 
Cc: virtualizat...@lists.linux-foundation.org
Cc: xen-de...@lists.xenproject.org
Link: 
http://lkml.kernel.org/r/1429901803-29771-6-git-send-email-waiman.l...@hp.com
Signed-off-by: Ingo Molnar 
---
 include/asm-generic/qspinlock_types.h | 13 +++
 kernel/locking/qspinlock.c| 69 ++-
 2 files changed, 81 insertions(+), 1 deletion(-)

diff --git a/include/asm-generic/qspinlock_types.h 
b/include/asm-generic/qspinlock_types.h
index 3a7f671..85f888e 100644
--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -35,6 +35,14 @@ typedef struct qspinlock {
 /*
  * Bitfields in the atomic value:
  *
+ * When NR_CPUS < 16K
+ *  0- 7: locked byte
+ * 8: pending
+ *  9-15: not used
+ * 16-17: tail index
+ * 18-31: tail cpu (+1)
+ *
+ * When NR_CPUS >= 16K
  *  0- 7: locked byte
  * 8: pending
  *  9-10: tail index
@@ -47,7 +55,11 @@ typedef struct qspinlock {
 #define _Q_LOCKED_MASK _Q_SET_MASK(LOCKED)
 
 #define _Q_PENDING_OFFSET  (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
+#if CONFIG_NR_CPUS < (1U << 14)
+#define _Q_PENDING_BITS8
+#else
 #define _Q_PENDING_BITS1
+#endif
 #define _Q_PENDING_MASK_Q_SET_MASK(PENDING)
 
 #define _Q_TAIL_IDX_OFFSET (_Q_PENDING_OFFSET + _Q_PENDING_BITS)
@@ -58,6 +70,7 @@ typedef struct qspinlock {
 #define _Q_TAIL_CPU_BITS   (32 - _Q_TAIL_CPU_OFFSET)
 #define _Q_TAIL_CPU_MASK   _Q_SET_MASK(TAIL_CPU)
 
+#define _Q_TAIL_OFFSET _Q_TAIL_IDX_OFFSET
 #define _Q_TAIL_MASK   (_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK)
 
 #define _Q_LOCKED_VAL  (1U << _Q_LOCKED_OFFSET)
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 82bb4a9..e17efe7 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /*
@@ -56,6 +57,10 @@
  * node; whereby avoiding the need to carry a node from lock to unlock, and
  * preserving existing lock API. This also makes the unlock code simpler and
  * faster.
+ *
+ * N.B. The current implementation only supports architectures that allow
+ *  atomic operations on smaller 8-bit and 16-bit data types.
+ *
  */
 
 #include "mcs_spinlock.h"
@@ -96,6 +101,62 @@ static inline struct mcs_spinlock *decode_tail(u32 tail)
 
 #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
 
+/*
+ * By using the whole 2nd least significant byte for the pending bit, we
+ * can allow better optimization of the lock acquisition for the pending
+ * bit holder.
+ */
+#if _Q_PENDING_BITS == 8
+
+struct __qspinlock {
+   union {
+   atomic_t val;
+   struct {
+#ifdef __LITTLE_ENDIAN
+   u16 locked_pending;
+   u16 tail;
+#else
+   u16 tail;
+   u16 locked_pending;
+#endif
+   };
+   };
+};
+
+/**
+ * clear_pending_set_locked - take ownership and clear the pending bit.
+ * @lock: Pointer to queued spinlock structure
+ *
+ * *,1,0 -> *,0,1
+ *
+ * Lock stealing is not allowed if this function is used.
+ */
+static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
+{
+   struct __qspinlock *l = (void *)lock;
+
+   WRITE_ONCE(l->locked_pending, _Q_LOCKED_VAL);
+}
+
+/*
+ * xchg_ta

[tip:locking/core] locking/qspinlock: Revert to test-and-set on hypervisors

2015-05-08 Thread tip-bot for Peter Zijlstra (Intel)
Commit-ID:  2aa79af64263190eec610422b07f60e99a7d230a
Gitweb: http://git.kernel.org/tip/2aa79af64263190eec610422b07f60e99a7d230a
Author: Peter Zijlstra (Intel) 
AuthorDate: Fri, 24 Apr 2015 14:56:36 -0400
Committer:  Ingo Molnar 
CommitDate: Fri, 8 May 2015 12:36:58 +0200

locking/qspinlock: Revert to test-and-set on hypervisors

When we detect a hypervisor (!paravirt, see qspinlock paravirt support
patches), revert to a simple test-and-set lock to avoid the horrors
of queue preemption.

Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Waiman Long 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Andrew Morton 
Cc: Boris Ostrovsky 
Cc: Borislav Petkov 
Cc: Daniel J Blueman 
Cc: David Vrabel 
Cc: Douglas Hatch 
Cc: H. Peter Anvin 
Cc: Konrad Rzeszutek Wilk 
Cc: Linus Torvalds 
Cc: Oleg Nesterov 
Cc: Paolo Bonzini 
Cc: Paul E. McKenney 
Cc: Peter Zijlstra 
Cc: Raghavendra K T 
Cc: Rik van Riel 
Cc: Scott J Norton 
Cc: Thomas Gleixner 
Cc: virtualizat...@lists.linux-foundation.org
Cc: xen-de...@lists.xenproject.org
Link: 
http://lkml.kernel.org/r/1429901803-29771-8-git-send-email-waiman.l...@hp.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/include/asm/qspinlock.h | 14 ++
 include/asm-generic/qspinlock.h  |  7 +++
 kernel/locking/qspinlock.c   |  3 +++
 3 files changed, 24 insertions(+)

diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index e2aee82..f079b70 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -1,6 +1,7 @@
 #ifndef _ASM_X86_QSPINLOCK_H
 #define _ASM_X86_QSPINLOCK_H
 
+#include 
 #include 
 
 #definequeued_spin_unlock queued_spin_unlock
@@ -15,6 +16,19 @@ static inline void queued_spin_unlock(struct qspinlock *lock)
smp_store_release((u8 *)lock, 0);
 }
 
+#define virt_queued_spin_lock virt_queued_spin_lock
+
+static inline bool virt_queued_spin_lock(struct qspinlock *lock)
+{
+   if (!static_cpu_has(X86_FEATURE_HYPERVISOR))
+   return false;
+
+   while (atomic_cmpxchg(&lock->val, 0, _Q_LOCKED_VAL) != 0)
+   cpu_relax();
+
+   return true;
+}
+
 #include 
 
 #endif /* _ASM_X86_QSPINLOCK_H */
diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
index 569abcd..83bfb87 100644
--- a/include/asm-generic/qspinlock.h
+++ b/include/asm-generic/qspinlock.h
@@ -111,6 +111,13 @@ static inline void queued_spin_unlock_wait(struct 
qspinlock *lock)
cpu_relax();
 }
 
+#ifndef virt_queued_spin_lock
+static __always_inline bool virt_queued_spin_lock(struct qspinlock *lock)
+{
+   return false;
+}
+#endif
+
 /*
  * Initializier
  */
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 0338721..fd31a47 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -249,6 +249,9 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 
val)
 
BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
 
+   if (virt_queued_spin_lock(lock))
+   return;
+
/*
 * wait for in-progress pending->locked hand-overs
 *
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[tip:locking/core] locking/pvqspinlock, x86: Implement the paravirt qspinlock call patching

2015-05-08 Thread tip-bot for Peter Zijlstra (Intel)
Commit-ID:  f233f7f1581e78fd9b4023f2e7d8c1ed89020cc9
Gitweb: http://git.kernel.org/tip/f233f7f1581e78fd9b4023f2e7d8c1ed89020cc9
Author: Peter Zijlstra (Intel) 
AuthorDate: Fri, 24 Apr 2015 14:56:38 -0400
Committer:  Ingo Molnar 
CommitDate: Fri, 8 May 2015 12:37:09 +0200

locking/pvqspinlock, x86: Implement the paravirt qspinlock call patching

We use the regular paravirt call patching to switch between:

  native_queued_spin_lock_slowpath()__pv_queued_spin_lock_slowpath()
  native_queued_spin_unlock()   __pv_queued_spin_unlock()

We use a callee saved call for the unlock function which reduces the
i-cache footprint and allows 'inlining' of SPIN_UNLOCK functions
again.

We further optimize the unlock path by patching the direct call with a
"movb $0,%arg1" if we are indeed using the native unlock code. This
makes the unlock code almost as fast as the !PARAVIRT case.

This significantly lowers the overhead of having
CONFIG_PARAVIRT_SPINLOCKS enabled, even for native code.

Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Waiman Long 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Andrew Morton 
Cc: Boris Ostrovsky 
Cc: Borislav Petkov 
Cc: Daniel J Blueman 
Cc: David Vrabel 
Cc: Douglas Hatch 
Cc: H. Peter Anvin 
Cc: Konrad Rzeszutek Wilk 
Cc: Linus Torvalds 
Cc: Oleg Nesterov 
Cc: Paolo Bonzini 
Cc: Paul E. McKenney 
Cc: Peter Zijlstra 
Cc: Raghavendra K T 
Cc: Rik van Riel 
Cc: Scott J Norton 
Cc: Thomas Gleixner 
Cc: virtualizat...@lists.linux-foundation.org
Cc: xen-de...@lists.xenproject.org
Link: 
http://lkml.kernel.org/r/1429901803-29771-10-git-send-email-waiman.l...@hp.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/Kconfig  |  2 +-
 arch/x86/include/asm/paravirt.h   | 29 -
 arch/x86/include/asm/paravirt_types.h | 10 ++
 arch/x86/include/asm/qspinlock.h  | 25 -
 arch/x86/include/asm/qspinlock_paravirt.h |  6 ++
 arch/x86/kernel/paravirt-spinlocks.c  | 24 +++-
 arch/x86/kernel/paravirt_patch_32.c   | 22 ++
 arch/x86/kernel/paravirt_patch_64.c   | 22 ++
 8 files changed, 128 insertions(+), 12 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 90b1b54..50ec043 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -667,7 +667,7 @@ config PARAVIRT_DEBUG
 config PARAVIRT_SPINLOCKS
bool "Paravirtualization layer for spinlocks"
depends on PARAVIRT && SMP
-   select UNINLINE_SPIN_UNLOCK
+   select UNINLINE_SPIN_UNLOCK if !QUEUED_SPINLOCK
---help---
  Paravirtualized spinlocks allow a pvops backend to replace the
  spinlock implementation with something virtualization-friendly
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 8957810..266c353 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -712,6 +712,31 @@ static inline void __set_fixmap(unsigned /* enum 
fixed_addresses */ idx,
 
 #if defined(CONFIG_SMP) && defined(CONFIG_PARAVIRT_SPINLOCKS)
 
+#ifdef CONFIG_QUEUED_SPINLOCK
+
+static __always_inline void pv_queued_spin_lock_slowpath(struct qspinlock 
*lock,
+   u32 val)
+{
+   PVOP_VCALL2(pv_lock_ops.queued_spin_lock_slowpath, lock, val);
+}
+
+static __always_inline void pv_queued_spin_unlock(struct qspinlock *lock)
+{
+   PVOP_VCALLEE1(pv_lock_ops.queued_spin_unlock, lock);
+}
+
+static __always_inline void pv_wait(u8 *ptr, u8 val)
+{
+   PVOP_VCALL2(pv_lock_ops.wait, ptr, val);
+}
+
+static __always_inline void pv_kick(int cpu)
+{
+   PVOP_VCALL1(pv_lock_ops.kick, cpu);
+}
+
+#else /* !CONFIG_QUEUED_SPINLOCK */
+
 static __always_inline void __ticket_lock_spinning(struct arch_spinlock *lock,
__ticket_t ticket)
 {
@@ -724,7 +749,9 @@ static __always_inline void __ticket_unlock_kick(struct 
arch_spinlock *lock,
PVOP_VCALL2(pv_lock_ops.unlock_kick, lock, ticket);
 }
 
-#endif
+#endif /* CONFIG_QUEUED_SPINLOCK */
+
+#endif /* SMP && PARAVIRT_SPINLOCKS */
 
 #ifdef CONFIG_X86_32
 #define PV_SAVE_REGS "pushl %ecx; pushl %edx;"
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index f7b0b5c..76cd684 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -333,9 +333,19 @@ struct arch_spinlock;
 typedef u16 __ticket_t;
 #endif
 
+struct qspinlock;
+
 struct pv_lock_ops {
+#ifdef CONFIG_QUEUED_SPINLOCK
+   void (*queued_spin_lock_slowpath)(struct qspinlock *lock, u32 val);
+   struct paravirt_callee_save queued_spin_unlock;
+
+   void (*wait)(u8 *ptr, u8 val);
+   void (*kick)(int cpu);
+#else /* !CONFIG_QUEUED_SPINLOCK */
struct paravirt_callee_save lock_spinning;
v

[tip:perf/core] perf: Fix move_group() order

2015-02-04 Thread tip-bot for Peter Zijlstra (Intel)
Commit-ID:  8f95b435b62522aed3381aaea920de8d09ccabf3
Gitweb: http://git.kernel.org/tip/8f95b435b62522aed3381aaea920de8d09ccabf3
Author: Peter Zijlstra (Intel) 
AuthorDate: Tue, 27 Jan 2015 11:53:12 +0100
Committer:  Ingo Molnar 
CommitDate: Wed, 4 Feb 2015 08:07:11 +0100

perf: Fix move_group() order

Jiri reported triggering the new WARN_ON_ONCE in event_sched_out over
the weekend:

  event_sched_out.isra.79+0x2b9/0x2d0
  group_sched_out+0x69/0xc0
  ctx_sched_out+0x106/0x130
  task_ctx_sched_out+0x37/0x70
  __perf_install_in_context+0x70/0x1a0
  remote_function+0x48/0x60
  generic_exec_single+0x15b/0x1d0
  smp_call_function_single+0x67/0xa0
  task_function_call+0x53/0x80
  perf_install_in_context+0x8b/0x110

I think the below should cure this; if we install a group leader it
will iterate the (still intact) group list and find its siblings and
try and install those too -- even though those still have the old
event->ctx -- in the new ctx.

Upon installing the first group sibling we'd try and schedule out the
group and trigger the above warn.

Fix this by installing the group leader last, installing siblings
would have no effect, they're not reachable through the group lists
and therefore we don't schedule them.

Also delay resetting the state until we're absolutely sure the events
are quiescent.

Reported-by: Jiri Olsa 
Reported-by: vincent.wea...@maine.edu
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Arnaldo Carvalho de Melo 
Cc: Linus Torvalds 
Link: 
http://lkml.kernel.org/r/20150126162639.ga21...@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar 
---
 kernel/events/core.c | 56 +++-
 1 file changed, 47 insertions(+), 9 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 417a96b..142dbabc 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7645,16 +7645,9 @@ SYSCALL_DEFINE5(perf_event_open,
 
perf_remove_from_context(group_leader, false);
 
-   /*
-* Removing from the context ends up with disabled
-* event. What we want here is event in the initial
-* startup state, ready to be add into new context.
-*/
-   perf_event__state_init(group_leader);
list_for_each_entry(sibling, &group_leader->sibling_list,
group_entry) {
perf_remove_from_context(sibling, false);
-   perf_event__state_init(sibling);
put_ctx(gctx);
}
} else {
@@ -7670,13 +7663,31 @@ SYSCALL_DEFINE5(perf_event_open,
 */
synchronize_rcu();
 
-   perf_install_in_context(ctx, group_leader, group_leader->cpu);
-   get_ctx(ctx);
+   /*
+* Install the group siblings before the group leader.
+*
+* Because a group leader will try and install the entire group
+* (through the sibling list, which is still in-tact), we can
+* end up with siblings installed in the wrong context.
+*
+* By installing siblings first we NO-OP because they're not
+* reachable through the group lists.
+*/
list_for_each_entry(sibling, &group_leader->sibling_list,
group_entry) {
+   perf_event__state_init(sibling);
perf_install_in_context(ctx, sibling, sibling->cpu);
get_ctx(ctx);
}
+
+   /*
+* Removing from the context ends up with disabled
+* event. What we want here is event in the initial
+* startup state, ready to be add into new context.
+*/
+   perf_event__state_init(group_leader);
+   perf_install_in_context(ctx, group_leader, group_leader->cpu);
+   get_ctx(ctx);
}
 
perf_install_in_context(ctx, event, event->cpu);
@@ -7806,8 +7817,35 @@ void perf_pmu_migrate_context(struct pmu *pmu, int 
src_cpu, int dst_cpu)
list_add(&event->migrate_entry, &events);
}
 
+   /*
+* Wait for the events to quiesce before re-instating them.
+*/
synchronize_rcu();
 
+   /*
+* Re-instate events in 2 passes.
+*
+* Skip over group leaders and only install siblings on this first
+* pass, siblings will not get enabled without a leader, however a
+* leader will enable its siblings, even if those are still on the old
+* context.
+*/
+   list_for_each_entry_safe(event, tmp, &events, migrate_entry) {
+   if (event->group_leader == event)
+   continue;
+
+

[tip:x86/boot] x86/kaslr, ACPI/NUMA: Fix KASLR build error

2018-10-03 Thread tip-bot for Peter Zijlstra (Intel)
Commit-ID:  3a387c6d96e69f1710a3804eb68e1253263298f2
Gitweb: https://git.kernel.org/tip/3a387c6d96e69f1710a3804eb68e1253263298f2
Author: Peter Zijlstra (Intel) 
AuthorDate: Wed, 3 Oct 2018 14:41:27 +0200
Committer:  Borislav Petkov 
CommitDate: Wed, 3 Oct 2018 16:15:49 +0200

x86/kaslr, ACPI/NUMA: Fix KASLR build error

There is no point in trying to compile KASLR-specific code when there is
no KASLR.

 [ bp: Move the whole crap into kaslr.c and make
   rand_mem_physical_padding static. ]

Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Borislav Petkov 
Cc: 
Cc: 
Cc: 
Cc: 
Cc: 
Cc: 
Link: 
http://lkml.kernel.org/r/20181003123402.ga15...@hirez.programming.kicks-ass.net
---
 arch/x86/include/asm/kaslr.h |  2 ++
 arch/x86/include/asm/setup.h |  2 --
 arch/x86/mm/kaslr.c  | 19 ++-
 drivers/acpi/numa.c  | 15 +++
 4 files changed, 23 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/kaslr.h b/arch/x86/include/asm/kaslr.h
index db7ba2feb947..95ef3fc01d12 100644
--- a/arch/x86/include/asm/kaslr.h
+++ b/arch/x86/include/asm/kaslr.h
@@ -6,8 +6,10 @@ unsigned long kaslr_get_random_long(const char *purpose);
 
 #ifdef CONFIG_RANDOMIZE_MEMORY
 void kernel_randomize_memory(void);
+void kaslr_check_padding(void);
 #else
 static inline void kernel_randomize_memory(void) { }
+static inline void kaslr_check_padding(void) { }
 #endif /* CONFIG_RANDOMIZE_MEMORY */
 
 #endif
diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
index 65a5bf8f6aba..ae13bc974416 100644
--- a/arch/x86/include/asm/setup.h
+++ b/arch/x86/include/asm/setup.h
@@ -80,8 +80,6 @@ static inline unsigned long kaslr_offset(void)
return (unsigned long)&_text - __START_KERNEL;
 }
 
-extern int rand_mem_physical_padding;
-
 /*
  * Do NOT EVER look at the BIOS memory size location.
  * It does not work on many machines.
diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c
index 00cf4cae38f5..b3471388288d 100644
--- a/arch/x86/mm/kaslr.c
+++ b/arch/x86/mm/kaslr.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -40,7 +41,7 @@
  */
 static const unsigned long vaddr_end = CPU_ENTRY_AREA_BASE;
 
-int __initdata rand_mem_physical_padding = 
CONFIG_RANDOMIZE_MEMORY_PHYSICAL_PADDING;
+static int __initdata rand_mem_physical_padding = 
CONFIG_RANDOMIZE_MEMORY_PHYSICAL_PADDING;
 /*
  * Memory regions randomized by KASLR (except modules that use a separate logic
  * earlier during boot). The list is ordered based on virtual addresses. This
@@ -70,6 +71,22 @@ static inline bool kaslr_memory_enabled(void)
return kaslr_enabled() && !IS_ENABLED(CONFIG_KASAN);
 }
 
+/*
+ * Check the padding size for KASLR is enough.
+ */
+void __init kaslr_check_padding(void)
+{
+   u64 max_possible_phys, max_actual_phys, threshold;
+
+   max_actual_phys = roundup(PFN_PHYS(max_pfn), 1ULL << 40);
+   max_possible_phys = roundup(PFN_PHYS(max_possible_pfn), 1ULL << 40);
+   threshold = max_actual_phys + ((u64)rand_mem_physical_padding << 40);
+
+   if (max_possible_phys > threshold)
+   pr_warn("Set 'rand_mem_physical_padding=%llu' to avoid memory 
hotadd failure.\n",
+   (max_possible_phys - max_actual_phys) >> 40);
+}
+
 static int __init rand_mem_physical_padding_setup(char *str)
 {
int max_padding = (1 << (MAX_PHYSMEM_BITS - TB_SHIFT)) - 1;
diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
index 3d69834c692f..4408e37600ef 100644
--- a/drivers/acpi/numa.c
+++ b/drivers/acpi/numa.c
@@ -32,7 +32,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 
 static nodemask_t nodes_found_map = NODE_MASK_NONE;
 
@@ -436,7 +436,6 @@ acpi_table_parse_srat(enum acpi_srat_type id,
 int __init acpi_numa_init(void)
 {
int cnt = 0;
-   u64 max_possible_phys, max_actual_phys, threshold;
 
if (acpi_disabled)
return -EINVAL;
@@ -466,17 +465,9 @@ int __init acpi_numa_init(void)
cnt = acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY,
acpi_parse_memory_affinity, 0);
 
-   /* check the padding size for KASLR is enough. */
-   if (parsed_numa_memblks && kaslr_enabled()) {
-   max_actual_phys = roundup(PFN_PHYS(max_pfn), 1ULL << 
40);
-   max_possible_phys = roundup(PFN_PHYS(max_possible_pfn), 
1ULL << 40);
-   threshold = max_actual_phys + 
((u64)rand_mem_physical_padding << 40);
+   if (parsed_numa_memblks)
+   kaslr_check_padding();
 
-   if (max_possible_phys > threshold) {
-   pr_warn("Set 'rand_mem_physical_padding=%llu' 
to avoid memory hotadd failure.\n",
- (max_possible_phys -

[tip:perf/core] perf annotate: Simplify header dotted line sizing

2016-06-30 Thread tip-bot for Peter Zijlstra (Intel)
Commit-ID:  53dd9b5f95dda95bcadda1b4680be42dfe1f9e5e
Gitweb: http://git.kernel.org/tip/53dd9b5f95dda95bcadda1b4680be42dfe1f9e5e
Author: Peter Zijlstra (Intel) 
AuthorDate: Thu, 30 Jun 2016 09:17:26 -0300
Committer:  Arnaldo Carvalho de Melo 
CommitDate: Thu, 30 Jun 2016 09:21:03 -0300

perf annotate: Simplify header dotted line sizing

No need to use strlen, etc to figure that out, just use the return from
printf(), it will tell how wide the following line needs to be.

Signed-off-by: Peter Zijlstra (Intel) 
Cc: Jiri Olsa 
Link: 
http://lkml.kernel.org/r/20160630082955.ga30...@twins.programming.kicks-ass.net
[ split from a larger patch ]
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/perf/util/annotate.c | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/tools/perf/util/annotate.c b/tools/perf/util/annotate.c
index c385fec..78e5d6f 100644
--- a/tools/perf/util/annotate.c
+++ b/tools/perf/util/annotate.c
@@ -1528,7 +1528,7 @@ int symbol__annotate_printf(struct symbol *sym, struct 
map *map,
int more = 0;
u64 len;
int width = 8;
-   int namelen, evsel_name_len, graph_dotted_len;
+   int graph_dotted_len;
 
filename = strdup(dso->long_name);
if (!filename)
@@ -1540,17 +1540,14 @@ int symbol__annotate_printf(struct symbol *sym, struct 
map *map,
d_filename = basename(filename);
 
len = symbol__size(sym);
-   namelen = strlen(d_filename);
-   evsel_name_len = strlen(evsel_name);
 
if (perf_evsel__is_group_event(evsel))
width *= evsel->nr_members;
 
-   printf(" %-*.*s|Source code & Disassembly of %s for %s\n",
+   graph_dotted_len = printf(" %-*.*s| Source code & Disassembly of %s 
for %s\n",
   width, width, "Percent", d_filename, evsel_name);
 
-   graph_dotted_len = width + namelen + evsel_name_len;
-   printf("-%-*.*s-\n",
+   printf("%-*.*s\n",
   graph_dotted_len, graph_dotted_len, graph_dotted_line);
 
if (verbose)


[tip:perf/core] perf annotate: Add number of samples to the header

2016-06-30 Thread tip-bot for Peter Zijlstra (Intel)
Commit-ID:  135cce1bf12bd30d7d66360022f9dac6ea3a07cd
Gitweb: http://git.kernel.org/tip/135cce1bf12bd30d7d66360022f9dac6ea3a07cd
Author: Peter Zijlstra (Intel) 
AuthorDate: Thu, 30 Jun 2016 10:29:55 +0200
Committer:  Arnaldo Carvalho de Melo 
CommitDate: Thu, 30 Jun 2016 18:27:42 -0300

perf annotate: Add number of samples to the header

Staring at annotations of large functions is useless if there's only a
few samples in them. Report the number of samples in the header to make
this easier to determine.

Committer note:

The change amounts to:

  - Percent | Source code & Disassembly of perf-vdso.so for cycles:u
  --
  + Percent | Source code & Disassembly of perf-vdso.so for cycles:u (3278 
samples)
  
+

Signed-off-by: Peter Zijlstra (Intel) 
Cc: Jiri Olsa 
Link: 
http://lkml.kernel.org/r/20160630082955.ga30...@twins.programming.kicks-ass.net
[ split from a larger patch ]
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/perf/util/annotate.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/tools/perf/util/annotate.c b/tools/perf/util/annotate.c
index 78e5d6f..e9825fe 100644
--- a/tools/perf/util/annotate.c
+++ b/tools/perf/util/annotate.c
@@ -1522,6 +1522,7 @@ int symbol__annotate_printf(struct symbol *sym, struct 
map *map,
const char *d_filename;
const char *evsel_name = perf_evsel__name(evsel);
struct annotation *notes = symbol__annotation(sym);
+   struct sym_hist *h = annotation__histogram(notes, evsel->idx);
struct disasm_line *pos, *queue = NULL;
u64 start = map__rip_2objdump(map, sym->start);
int printed = 2, queue_len = 0;
@@ -1544,8 +1545,8 @@ int symbol__annotate_printf(struct symbol *sym, struct 
map *map,
if (perf_evsel__is_group_event(evsel))
width *= evsel->nr_members;
 
-   graph_dotted_len = printf(" %-*.*s| Source code & Disassembly of %s 
for %s\n",
-  width, width, "Percent", d_filename, evsel_name);
+   graph_dotted_len = printf(" %-*.*s| Source code & Disassembly of %s 
for %s (%" PRIu64 " samples)\n",
+  width, width, "Percent", d_filename, evsel_name, h->sum);
 
printf("%-*.*s\n",
   graph_dotted_len, graph_dotted_len, graph_dotted_line);


[tip: sched/core] sched: Clean up scheduler_ipi()

2020-05-12 Thread tip-bot2 for Peter Zijlstra (Intel)
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 90b5363acd4739769c3f38c1aff16171bd133e8c
Gitweb:
https://git.kernel.org/tip/90b5363acd4739769c3f38c1aff16171bd133e8c
Author:Peter Zijlstra (Intel) 
AuthorDate:Fri, 27 Mar 2020 11:44:56 +01:00
Committer: Thomas Gleixner 
CommitterDate: Tue, 12 May 2020 17:10:48 +02:00

sched: Clean up scheduler_ipi()

The scheduler IPI has grown weird and wonderful over the years, time
for spring cleaning.

Move all the non-trivial stuff out of it and into a regular smp function
call IPI. This then reduces the schedule_ipi() to most of it's former NOP
glory and ensures to keep the interrupt vector lean and mean.

Aside of that avoiding the full irq_enter() in the x86 IPI implementation
is incorrect as scheduler_ipi() can be instrumented. To work around that
scheduler_ipi() had an irq_enter/exit() hack when heavy work was
pending. This is gone now.

Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Thomas Gleixner 
Reviewed-by: Alexandre Chartre 
Link: https://lkml.kernel.org/r/20200505134058.361859...@linutronix.de
---
 kernel/sched/core.c  | 64 ---
 kernel/sched/fair.c  |  5 +--
 kernel/sched/sched.h |  6 ++--
 3 files changed, 36 insertions(+), 39 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b58efb1..cd2070d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -219,6 +219,13 @@ void update_rq_clock(struct rq *rq)
update_rq_clock_task(rq, delta);
 }
 
+static inline void
+rq_csd_init(struct rq *rq, call_single_data_t *csd, smp_call_func_t func)
+{
+   csd->flags = 0;
+   csd->func = func;
+   csd->info = rq;
+}
 
 #ifdef CONFIG_SCHED_HRTICK
 /*
@@ -314,16 +321,14 @@ void hrtick_start(struct rq *rq, u64 delay)
hrtimer_start(&rq->hrtick_timer, ns_to_ktime(delay),
  HRTIMER_MODE_REL_PINNED_HARD);
 }
+
 #endif /* CONFIG_SMP */
 
 static void hrtick_rq_init(struct rq *rq)
 {
 #ifdef CONFIG_SMP
-   rq->hrtick_csd.flags = 0;
-   rq->hrtick_csd.func = __hrtick_start;
-   rq->hrtick_csd.info = rq;
+   rq_csd_init(rq, &rq->hrtick_csd, __hrtick_start);
 #endif
-
hrtimer_init(&rq->hrtick_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_HARD);
rq->hrtick_timer.function = hrtick;
 }
@@ -650,6 +655,16 @@ static inline bool got_nohz_idle_kick(void)
return false;
 }
 
+static void nohz_csd_func(void *info)
+{
+   struct rq *rq = info;
+
+   if (got_nohz_idle_kick()) {
+   rq->idle_balance = 1;
+   raise_softirq_irqoff(SCHED_SOFTIRQ);
+   }
+}
+
 #else /* CONFIG_NO_HZ_COMMON */
 
 static inline bool got_nohz_idle_kick(void)
@@ -2292,6 +2307,11 @@ void sched_ttwu_pending(void)
rq_unlock_irqrestore(rq, &rf);
 }
 
+static void wake_csd_func(void *info)
+{
+   sched_ttwu_pending();
+}
+
 void scheduler_ipi(void)
 {
/*
@@ -2300,34 +2320,6 @@ void scheduler_ipi(void)
 * this IPI.
 */
preempt_fold_need_resched();
-
-   if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick())
-   return;
-
-   /*
-* Not all reschedule IPI handlers call irq_enter/irq_exit, since
-* traditionally all their work was done from the interrupt return
-* path. Now that we actually do some work, we need to make sure
-* we do call them.
-*
-* Some archs already do call them, luckily irq_enter/exit nest
-* properly.
-*
-* Arguably we should visit all archs and update all handlers,
-* however a fair share of IPIs are still resched only so this would
-* somewhat pessimize the simple resched case.
-*/
-   irq_enter();
-   sched_ttwu_pending();
-
-   /*
-* Check if someone kicked us for doing the nohz idle load balance.
-*/
-   if (unlikely(got_nohz_idle_kick())) {
-   this_rq()->idle_balance = 1;
-   raise_softirq_irqoff(SCHED_SOFTIRQ);
-   }
-   irq_exit();
 }
 
 static void ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
@@ -2336,9 +2328,9 @@ static void ttwu_queue_remote(struct task_struct *p, int 
cpu, int wake_flags)
 
p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);
 
-   if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list)) {
+   if (llist_add(&p->wake_entry, &rq->wake_list)) {
if (!set_nr_if_polling(rq->idle))
-   smp_send_reschedule(cpu);
+   smp_call_function_single_async(cpu, &rq->wake_csd);
else
trace_sched_wake_idle_without_ipi(cpu);
}
@@ -6693,12 +6685,16 @@ void __init sched_init(void)
rq->avg_idle = 2*sysctl_sched_migration_cost;
   

[tip:smp/hotplug] jump_label: Pull get_online_cpus() into generic code

2017-04-20 Thread tip-bot for Peter Zijlstra (Intel)
Commit-ID:  82947f31231157d8ab70fa8961f23fd3887a3327
Gitweb: http://git.kernel.org/tip/82947f31231157d8ab70fa8961f23fd3887a3327
Author: Peter Zijlstra (Intel) 
AuthorDate: Tue, 18 Apr 2017 19:05:03 +0200
Committer:  Thomas Gleixner 
CommitDate: Thu, 20 Apr 2017 13:08:57 +0200

jump_label: Pull get_online_cpus() into generic code

This change does two things:

- it moves the get_online_cpus() call into generic code, with the aim of
  later providing some static_key ops that avoid it.

- as a side effect it inverts the lock order between cpu_hotplug_lock and
  jump_label_mutex.

Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Thomas Gleixner 
Cc: Sebastian Siewior 
Cc: Steven Rostedt 
Cc: jba...@akamai.com
Link: http://lkml.kernel.org/r/20170418103422.590118...@infradead.org

---
 arch/mips/kernel/jump_label.c  |  2 --
 arch/sparc/kernel/jump_label.c |  2 --
 arch/tile/kernel/jump_label.c  |  2 --
 arch/x86/kernel/jump_label.c   |  2 --
 kernel/jump_label.c| 14 ++
 5 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/arch/mips/kernel/jump_label.c b/arch/mips/kernel/jump_label.c
index 3e586da..32e3168 100644
--- a/arch/mips/kernel/jump_label.c
+++ b/arch/mips/kernel/jump_label.c
@@ -58,7 +58,6 @@ void arch_jump_label_transform(struct jump_entry *e,
insn.word = 0; /* nop */
}
 
-   get_online_cpus();
mutex_lock(&text_mutex);
if (IS_ENABLED(CONFIG_CPU_MICROMIPS)) {
insn_p->halfword[0] = insn.word >> 16;
@@ -70,7 +69,6 @@ void arch_jump_label_transform(struct jump_entry *e,
   (unsigned long)insn_p + sizeof(*insn_p));
 
mutex_unlock(&text_mutex);
-   put_online_cpus();
 }
 
 #endif /* HAVE_JUMP_LABEL */
diff --git a/arch/sparc/kernel/jump_label.c b/arch/sparc/kernel/jump_label.c
index 07933b9..93adde1 100644
--- a/arch/sparc/kernel/jump_label.c
+++ b/arch/sparc/kernel/jump_label.c
@@ -41,12 +41,10 @@ void arch_jump_label_transform(struct jump_entry *entry,
val = 0x0100;
}
 
-   get_online_cpus();
mutex_lock(&text_mutex);
*insn = val;
flushi(insn);
mutex_unlock(&text_mutex);
-   put_online_cpus();
 }
 
 #endif
diff --git a/arch/tile/kernel/jump_label.c b/arch/tile/kernel/jump_label.c
index 07802d5..93931a4 100644
--- a/arch/tile/kernel/jump_label.c
+++ b/arch/tile/kernel/jump_label.c
@@ -45,14 +45,12 @@ static void __jump_label_transform(struct jump_entry *e,
 void arch_jump_label_transform(struct jump_entry *e,
enum jump_label_type type)
 {
-   get_online_cpus();
mutex_lock(&text_mutex);
 
__jump_label_transform(e, type);
flush_icache_range(e->code, e->code + sizeof(tilegx_bundle_bits));
 
mutex_unlock(&text_mutex);
-   put_online_cpus();
 }
 
 __init_or_module void arch_jump_label_transform_static(struct jump_entry *e,
diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c
index c37bd0f..ab4f491 100644
--- a/arch/x86/kernel/jump_label.c
+++ b/arch/x86/kernel/jump_label.c
@@ -105,11 +105,9 @@ static void __jump_label_transform(struct jump_entry 
*entry,
 void arch_jump_label_transform(struct jump_entry *entry,
   enum jump_label_type type)
 {
-   get_online_cpus();
mutex_lock(&text_mutex);
__jump_label_transform(entry, type, NULL, 0);
mutex_unlock(&text_mutex);
-   put_online_cpus();
 }
 
 static enum {
diff --git a/kernel/jump_label.c b/kernel/jump_label.c
index 6c9cb20..f3afe07 100644
--- a/kernel/jump_label.c
+++ b/kernel/jump_label.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef HAVE_JUMP_LABEL
 
@@ -124,6 +125,12 @@ void static_key_slow_inc(struct static_key *key)
return;
}
 
+   /*
+* A number of architectures need to synchronize I$ across
+* the all CPUs, for that to be serialized against CPU hot-plug
+* we need to avoid CPUs coming online.
+*/
+   get_online_cpus();
jump_label_lock();
if (atomic_read(&key->enabled) == 0) {
atomic_set(&key->enabled, -1);
@@ -133,6 +140,7 @@ void static_key_slow_inc(struct static_key *key)
atomic_inc(&key->enabled);
}
jump_label_unlock();
+   put_online_cpus();
 }
 EXPORT_SYMBOL_GPL(static_key_slow_inc);
 
@@ -146,6 +154,7 @@ static void __static_key_slow_dec(struct static_key *key,
 * returns is unbalanced, because all other static_key_slow_inc()
 * instances block while the update is in progress.
 */
+   get_online_cpus();
if (!atomic_dec_and_mutex_lock(&key->enabled, &jump_label_mutex)) {
WARN(atomic_read(&key->enabled) < 0,
 "jump label: negative count!\n");
@@ -159,6 +168,7

[tip:smp/hotplug] jump_label: Provide static_key_slow_inc_cpuslocked()

2017-04-20 Thread tip-bot for Peter Zijlstra (Intel)
Commit-ID:  f5efc6fad63f5533a6083e95286920d5753e52bf
Gitweb: http://git.kernel.org/tip/f5efc6fad63f5533a6083e95286920d5753e52bf
Author: Peter Zijlstra (Intel) 
AuthorDate: Tue, 18 Apr 2017 19:05:04 +0200
Committer:  Thomas Gleixner 
CommitDate: Thu, 20 Apr 2017 13:08:57 +0200

jump_label: Provide static_key_slow_inc_cpuslocked()

Provide static_key_slow_inc_cpuslocked(), a variant that doesn't take
cpu_hotplug_lock().

Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Thomas Gleixner 
Cc: Sebastian Siewior 
Cc: Steven Rostedt 
Cc: jba...@akamai.com
Link: http://lkml.kernel.org/r/20170418103422.636958...@infradead.org

---
 include/linux/jump_label.h |  3 +++
 kernel/jump_label.c| 21 +
 2 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
index 2afd74b..7d07f0b 100644
--- a/include/linux/jump_label.h
+++ b/include/linux/jump_label.h
@@ -158,6 +158,7 @@ extern void arch_jump_label_transform_static(struct 
jump_entry *entry,
 enum jump_label_type type);
 extern int jump_label_text_reserved(void *start, void *end);
 extern void static_key_slow_inc(struct static_key *key);
+extern void static_key_slow_inc_cpuslocked(struct static_key *key);
 extern void static_key_slow_dec(struct static_key *key);
 extern void jump_label_apply_nops(struct module *mod);
 extern int static_key_count(struct static_key *key);
@@ -213,6 +214,8 @@ static inline void static_key_slow_inc(struct static_key 
*key)
atomic_inc(&key->enabled);
 }
 
+#define static_key_slow_inc_cpuslocked static_key_slow_inc
+
 static inline void static_key_slow_dec(struct static_key *key)
 {
STATIC_KEY_CHECK_USE();
diff --git a/kernel/jump_label.c b/kernel/jump_label.c
index f3afe07..308b12e 100644
--- a/kernel/jump_label.c
+++ b/kernel/jump_label.c
@@ -101,7 +101,7 @@ void static_key_disable(struct static_key *key)
 }
 EXPORT_SYMBOL_GPL(static_key_disable);
 
-void static_key_slow_inc(struct static_key *key)
+void __static_key_slow_inc(struct static_key *key)
 {
int v, v1;
 
@@ -130,7 +130,6 @@ void static_key_slow_inc(struct static_key *key)
 * the all CPUs, for that to be serialized against CPU hot-plug
 * we need to avoid CPUs coming online.
 */
-   get_online_cpus();
jump_label_lock();
if (atomic_read(&key->enabled) == 0) {
atomic_set(&key->enabled, -1);
@@ -140,10 +139,22 @@ void static_key_slow_inc(struct static_key *key)
atomic_inc(&key->enabled);
}
jump_label_unlock();
+}
+
+void static_key_slow_inc(struct static_key *key)
+{
+   get_online_cpus();
+   __static_key_slow_inc(key);
put_online_cpus();
 }
 EXPORT_SYMBOL_GPL(static_key_slow_inc);
 
+void static_key_slow_inc_cpuslocked(struct static_key *key)
+{
+   __static_key_slow_inc(key);
+}
+EXPORT_SYMBOL_GPL(static_key_slow_inc_cpuslocked);
+
 static void __static_key_slow_dec(struct static_key *key,
unsigned long rate_limit, struct delayed_work *work)
 {
@@ -154,7 +165,6 @@ static void __static_key_slow_dec(struct static_key *key,
 * returns is unbalanced, because all other static_key_slow_inc()
 * instances block while the update is in progress.
 */
-   get_online_cpus();
if (!atomic_dec_and_mutex_lock(&key->enabled, &jump_label_mutex)) {
WARN(atomic_read(&key->enabled) < 0,
 "jump label: negative count!\n");
@@ -168,20 +178,23 @@ static void __static_key_slow_dec(struct static_key *key,
jump_label_update(key);
}
jump_label_unlock();
-   put_online_cpus();
 }
 
 static void jump_label_update_timeout(struct work_struct *work)
 {
struct static_key_deferred *key =
container_of(work, struct static_key_deferred, work.work);
+   get_online_cpus();
__static_key_slow_dec(&key->key, 0, NULL);
+   put_online_cpus();
 }
 
 void static_key_slow_dec(struct static_key *key)
 {
STATIC_KEY_CHECK_USE();
+   get_online_cpus();
__static_key_slow_dec(key, 0, NULL);
+   put_online_cpus();
 }
 EXPORT_SYMBOL_GPL(static_key_slow_dec);
 


[tip:smp/hotplug] perf: Avoid cpu_hotplug_lock r-r recursion

2017-04-20 Thread tip-bot for Peter Zijlstra (Intel)
Commit-ID:  641693094ee1568502280f95900f374b2226b51d
Gitweb: http://git.kernel.org/tip/641693094ee1568502280f95900f374b2226b51d
Author: Peter Zijlstra (Intel) 
AuthorDate: Tue, 18 Apr 2017 19:05:05 +0200
Committer:  Thomas Gleixner 
CommitDate: Thu, 20 Apr 2017 13:08:57 +0200

perf: Avoid cpu_hotplug_lock r-r recursion

There are two call-sites where using static_key results in recursing on the
cpu_hotplug_lock.

Use the hotplug locked version of static_key_slow_inc().

Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Thomas Gleixner 
Cc: Sebastian Siewior 
Cc: Steven Rostedt 
Cc: jba...@akamai.com
Link: http://lkml.kernel.org/r/20170418103422.687248...@infradead.org

---
 kernel/events/core.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 634dd95..8aa3063 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7653,7 +7653,7 @@ static int perf_swevent_init(struct perf_event *event)
if (err)
return err;
 
-   static_key_slow_inc(&perf_swevent_enabled[event_id]);
+   static_key_slow_inc_cpuslocked(&perf_swevent_enabled[event_id]);
event->destroy = sw_perf_event_destroy;
}
 
@@ -9160,7 +9160,7 @@ static void account_event(struct perf_event *event)
 
mutex_lock(&perf_sched_mutex);
if (!atomic_read(&perf_sched_count)) {
-   static_branch_enable(&perf_sched_events);
+   static_key_slow_inc_cpuslocked(&perf_sched_events.key);
/*
 * Guarantee that all CPUs observe they key change and
 * call the perf scheduling hooks before proceeding to