Re: [PATCH RFC 0/4] x86/fixmap: Unify FIXADDR_TOP

2023-06-10 Thread Thomas Gleixner
On Thu, Jun 08 2023 at 17:33, Hou Wenlong wrote:
> On Wed, Jun 07, 2023 at 08:49:15PM +0800, Dave Hansen wrote:
>> What problems does this patch set solve?  How might that solution be
>> visible to end users?  Why is this problem important to you?
>
> We want to build the kernel as PIE and allow the kernel image area,
> including the fixmap area, to be placed at any virtual address.

You are still failing to tell us why you want that and which problem
this solves. Just that fact that you want to something is not an
argument.

> We have also implemented a PV Linux guest based on PIE, which can be
> used in software virtualization similar to Lguest. PIE makes the guest
> kernel share the host kernel space similar to a normal userspace
> process.  Additionally, we are considering whether it is possible to
> use PIE and PVOPS to implement a user-mode kernel.

That solves what?

Thanks,

tglx
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC 0/4] x86/fixmap: Unify FIXADDR_TOP

2023-06-07 Thread Thomas Gleixner
On Mon, May 15 2023 at 16:19, Hou Wenlong wrote:

> This patchset unifies FIXADDR_TOP as a variable for x86, allowing the
> fixmap area to be movable and relocated with the kernel image in the
> x86/PIE patchset [0]. This enables the kernel image to be relocated in
> the top 512G of the address space.

What for? What's the use case.

Please provide a proper argument why this is generally useful and
important.

Thanks,

tglx
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 01/11] genirq/affinity:: Export irq_create_affinity_masks()

2023-02-13 Thread Thomas Gleixner
On Mon, Feb 13 2023 at 22:50, Yongji Xie wrote:
> On Mon, Feb 13, 2023 at 8:00 PM Michael S. Tsirkin  wrote:
> I can try to split irq_create_affinity_masks() into a common part and
> an irq specific part, and move the common part to a common dir such as
> /lib and export it. Then we can use the common part to build a new API
> for usage.

  https://lore.kernel.org/all/20221227022905.352674-1-ming@redhat.com/

Thanks,

tglx
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2] x86/hotplug: Do not put offline vCPUs in mwait idle state

2023-01-19 Thread Thomas Gleixner
On Mon, Jan 16 2023 at 15:55, Igor Mammedov wrote:
> "Srivatsa S. Bhat"  wrote:
>> Fix this by preventing the use of mwait idle state in the vCPU offline
>> play_dead() path for any hypervisor, even if mwait support is
>> available.
>
> if mwait is enabled, it's very likely guest to have cpuidle
> enabled and using the same mwait as well. So exiting early from
>  mwait_play_dead(), might just punt workflow down:
>   native_play_dead()
> ...
> mwait_play_dead();
> if (cpuidle_play_dead())   <- possible mwait here 
>  
> hlt_play_dead(); 
>
> and it will end up in mwait again and only if that fails
> it will go HLT route and maybe transition to VMM.

Good point.

> Instead of workaround on guest side,
> shouldn't hypervisor force VMEXIT on being uplugged vCPU when it's
> actually hot-unplugging vCPU? (ex: QEMU kicks vCPU out from guest
> context when it is removing vCPU, among other things)

For a pure guest side CPU unplug operation:

guest$ echo 0 >/sys/devices/system/cpu/cpu$N/online

the hypervisor is not involved at all. The vCPU is not removed in that
case.

So to ensure that this ends up in HLT something like the below is
required.

Note, the removal of the comment after mwait_play_dead() is intentional
because the comment is completely bogus. Not having MWAIT is not a
failure. But that wants to be a seperate patch.

Thanks,

tglx
---
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 55cad72715d9..3f1f20f71ec5 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1833,7 +1833,10 @@ void native_play_dead(void)
play_dead_common();
tboot_shutdown(TB_SHUTDOWN_WFS);
 
-   mwait_play_dead();  /* Only returns on failure */
+   if (this_cpu_has(X86_FEATURE_HYPERVISOR))
+   hlt_play_dead();
+
+   mwait_play_dead();
if (cpuidle_play_dead())
hlt_play_dead();
 }


  
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2] x86/vmware: use unsigned integer for shifting

2022-05-20 Thread Thomas Gleixner
On Fri, May 20 2022 at 19:39, Shreenidhi Shedi wrote:

> From: Shreenidhi Shedi 
>
> From: Shreenidhi Shedi 

Can you please decide which of your personalities wrote that patch?

> Shifting signed 32-bit value by 31 bits is implementation-defined
> behaviour. Using unsigned is better option for this.

Better option? There are no options. It's either correct or not. Please
be precise and technical in your wording.

> Fixes: 4cca6ea04d31 ("x86/apic: Allow x2apic without IR on VMware platform")
>
> Signed-off-by: Shreenidhi Shedi 

Please keep the tags together. This extra new line is pointless and
makes the maintainer do extra work to remove it.

Documentation/process/* has all the relevant directives for
you. Following them is not an option. It's mandatory.

> @@ -476,8 +477,8 @@ static bool __init vmware_legacy_x2apic_available(void)
>  {
>   uint32_t eax, ebx, ecx, edx;
>   VMWARE_CMD(GETVCPU_INFO, eax, ebx, ecx, edx);
> - return (eax & (1 << VMWARE_CMD_VCPU_RESERVED)) == 0 &&
> -(eax & (1 << VMWARE_CMD_LEGACY_X2APIC)) != 0;
> + return !(eax & BIT(VMWARE_CMD_VCPU_RESERVED)) &&
> + (eax & BIT(VMWARE_CMD_LEGACY_X2APIC))

Testing your changes before submission is not optional either. How is
this supposed to compile?

Thanks,

tglx
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V2 03/11] perf/x86: Add support for TSC in nanoseconds as a perf event clock

2022-04-27 Thread Thomas Gleixner
On Tue, Apr 26 2022 at 09:51, Adrian Hunter wrote:
> On 25/04/22 20:05, Thomas Gleixner wrote:
>> On Mon, Apr 25 2022 at 16:15, Adrian Hunter wrote:
>>> On 25/04/22 12:32, Thomas Gleixner wrote:
>>>> It's hillarious, that we still cling to this pvclock abomination, while
>>>> we happily expose TSC deadline timer to the guest. TSC virt scaling was
>>>> implemented in hardware for a reason.
>>>
>>> So you are talking about changing VMX TCS Offset on every VM-Entry to try 
>>> to hide
>>> the time jumps when the VM is scheduled out?  Or neglect that and just let 
>>> the time
>>> jumps happen?
>>>
>>> If changing VMX TCS Offset, how can TSC be kept consistent between each 
>>> VCPU i.e.
>>> wouldn't that mean each VCPU has to have the same VMX TSC Offset?
>> 
>> Obviously so. That's the only thing which makes sense, no?
>
> [ Sending this again, because I notice I messed up the email "From" ]
>
> But wouldn't that mean changing all the VCPUs VMX TSC Offset at the same time,
> which means when none are currently executing?  How could that be done?

Why would you change TSC offset after the point where a VM is started
and why would it be different per vCPU?

Time is global and time moves on when a vCPU is scheduled out. Anything
else is bonkers, really. If the hypervisor tries to screw with that then
how does the guest do timekeeping in a consistent way?

CLOCK_REALTIME = CLOCK_MONOTONIC + offset

That offset changes when something sets the clock, i.e. clock_settime(),
settimeofday() or adjtimex() in case that NTP cannot compensate or for
the beloved leap seconds adjustment. At any other time the offset is
constant.

CLOCK_MONOTONIC is derived from the underlying clocksource which is
expected to increment with constant frequency and that has to be
consistent accross _all_ vCPUs of a particular VM.

So how would a hypervisor 'hide' scheduled out time w/o screwing up
timekeeping completely?

The guest TSC which is based on the host TSC is:

guestTSC = offset + hostTSC * factor;

If you make offset different between guest vCPUs then timekeeping in the
guest is screwed.

The whole point of that paravirt clock was to handle migration between
hosts which did not have the VMCS TSC scaling/offset mechanism. The CPUs
which did not have that went EOL at least 10 years ago.

So what are you concerned about?

Thanks,

tglx
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V2 03/11] perf/x86: Add support for TSC in nanoseconds as a perf event clock

2022-04-25 Thread Thomas Gleixner
On Mon, Apr 25 2022 at 16:15, Adrian Hunter wrote:
> On 25/04/22 12:32, Thomas Gleixner wrote:
>> It's hillarious, that we still cling to this pvclock abomination, while
>> we happily expose TSC deadline timer to the guest. TSC virt scaling was
>> implemented in hardware for a reason.
>
> So you are talking about changing VMX TCS Offset on every VM-Entry to try to 
> hide
> the time jumps when the VM is scheduled out?  Or neglect that and just let 
> the time
> jumps happen?
>
> If changing VMX TCS Offset, how can TSC be kept consistent between each VCPU 
> i.e.
> wouldn't that mean each VCPU has to have the same VMX TSC Offset?

Obviously so. That's the only thing which makes sense, no?

Thanks,

tglx
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V2 03/11] perf/x86: Add support for TSC in nanoseconds as a perf event clock

2022-04-25 Thread Thomas Gleixner
On Mon, Apr 25 2022 at 08:30, Adrian Hunter wrote:
> On 14/03/22 13:50, Adrian Hunter wrote:
>>> TSC offsetting may also be a problem. The VMCS TSC offset must be 
>>> discoverable by the
>>> guest. This can be done via TSC_ADJUST MSR. The offset in the VMCS and the 
>>> guest
>>> TSC_ADJUST MSR must always be equivalent, i.e. a write to TSC_ADJUST in the 
>>> guest
>>> must be reflected in the VMCS and any changes to the offset in the VMCS 
>>> must be
>>> reflected in the TSC_ADJUST MSR. Otherwise a para-virtualized method must
>>> be invented to communicate an arbitrary VMCS TSC offset to the guest.
>>>
>> 
>> In my view it is reasonable for perf to support TSC as a perf clock in any 
>> case
>> because:
>>  a) it allows users to work entirely with TSC if they wish
>>  b) other kernel performance / debug facilities like ftrace already 
>> support TSC
>>  c) the patches to add TSC support are relatively small and 
>> straight-forward
>> 
>> May we have support for TSC as a perf event clock?
>
> Any update on this?

If TSC is reliable on the host, then there is absolutely no reason not
to use it in the guest all over the place. And that is independent of
exposing ART to the guest.

So why do we need extra solutions for PT and perf, ftrace and whatever?

Can we just fix the underlying problem and make the hypervisor tell the
guest that TSC is stable, reliable and good to use?

Then everything else just falls into place and using TSC is a
substantial performance gain in general. Just look at the VDSO
implementation of __arch_get_hw_counter() -> vread_pvclock():

Instead of just reading the TSC, this needs to take a nested seqcount,
read TSC and do yet another mult/shift, which makes clock_gettime() ~20%
slower than necessary.

It's hillarious, that we still cling to this pvclock abomination, while
we happily expose TSC deadline timer to the guest. TSC virt scaling was
implemented in hardware for a reason.

Thanks,

tglx
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re:

2022-03-29 Thread Thomas Gleixner
On Tue, Mar 29 2022 at 10:37, Michael S. Tsirkin wrote:
> On Tue, Mar 29, 2022 at 10:35:21AM +0200, Thomas Gleixner wrote:
> We are trying to fix the driver since at the moment it does not
> have the dev->ok flag at all.
>
> And I suspect virtio is not alone in that.
> So it would have been nice if there was a standard flag
> replacing the driver-specific dev->ok above, and ideally
> would also handle the case of an interrupt triggering
> too early by deferring the interrupt until the flag is set.
>
> And in fact, it does kind of exist: IRQF_NO_AUTOEN, and you would call
> enable_irq instead of dev->ok = true, except
> - it doesn't work with affinity managed IRQs
> - it does not work with shared IRQs
>
> So using dev->ok as you propose above seems better at this point.

Unless there is a big enough amount of drivers which could make use of a
generic mechanism for that.

>> If any driver does this in the wrong order, then the driver is
>> broken.
> 
> I agree, however:
> $ git grep synchronize_irq `git grep -l request_irq drivers/net/`|wc -l
> 113
> $ git grep -l request_irq drivers/net/|wc -l
> 397
>
> I suspect there are more drivers which in theory need the
> synchronize_irq dance but in practice do not execute it.

That really depends on when the driver requests the interrupt, when
it actually enables the interrupt in the device itself and how the
interrupt service routine works.

So just doing that grep dance does not tell much. You really have to do
a case by case analysis.

Thanks,

tglx

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re:

2022-03-29 Thread Thomas Gleixner
On Mon, Mar 28 2022 at 06:40, Michael S. Tsirkin wrote:
> On Mon, Mar 28, 2022 at 02:18:22PM +0800, Jason Wang wrote:
>> > > So I think we might talk different issues:
>> > >
>> > > 1) Whether request_irq() commits the previous setups, I think the
>> > > answer is yes, since the spin_unlock of desc->lock (release) can
>> > > guarantee this though there seems no documentation around
>> > > request_irq() to say this.
>> > >
>> > > And I can see at least drivers/video/fbdev/omap2/omapfb/dss/dispc.c is
>> > > using smp_wmb() before the request_irq().

That's a complete bogus example especially as there is not a single
smp_rmb() which pairs with the smp_wmb().

>> > > And even if write is ordered we still need read to be ordered to be
>> > > paired with that.
>
> IMO it synchronizes with the CPU to which irq is
> delivered. Otherwise basically all drivers would be broken,
> wouldn't they be?
> I don't know whether it's correct on all platforms, but if not
> we need to fix request_irq.

There is nothing to fix:

request_irq()
   raw_spin_lock_irq(desc->lock);   // ACQUIRE
   
   raw_spin_unlock_irq(desc->lock); // RELEASE

interrupt()
   raw_spin_lock(desc->lock);   // ACQUIRE
   set status to IN_PROGRESS
   raw_spin_unlock(desc->lock); // RELEASE
   invoke handler()

So anything which the driver set up _before_ request_irq() is visible to
the interrupt handler. No?

>> What happens if an interrupt is raised in the middle like:
>> 
>> smp_store_release(dev->irq_soft_enabled, true)
>> IRQ handler
>> synchornize_irq()

This is bogus. The obvious order of things is:

dev->ok = false;
request_irq();

moar_setup();
synchronize_irq();  // ACQUIRE + RELEASE
dev->ok = true;

The reverse operation on teardown:

dev->ok = false;
synchronize_irq();  // ACQUIRE + RELEASE

teardown();

So in both cases a simple check in the handler is sufficient:

handler()
if (!dev->ok)
return;

I'm not understanding what you folks are trying to "fix" here. If any
driver does this in the wrong order, then the driver is broken.

Sure, you can do the same with:

dev->ok = false;
request_irq();
moar_setup();
smp_wmb();
dev->ok = true;

for the price of a smp_rmb() in the interrupt handler:

handler()
if (!dev->ok)
return;
smp_rmb();

but that's only working for the setup case correctly and not for
teardown.

Thanks,

tglx
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: Which tree for paravirt related patches?

2021-11-04 Thread Thomas Gleixner
Srivatsa,

On Thu, Nov 04 2021 at 12:09, Srivatsa S. Bhat wrote:
> On a related note, I'll be stepping in soon to assist (in place of
> Deep) as a co-maintainer of the PARAVIRT_OPS interface. I had the same
> query about which tree would be best for patches to the paravirt-ops
> code, so I'm glad to see that it got clarified on this thread.

Welcome to the club.

> I'll also be taking over the maintainership of the VMware hypervisor
> interface. Looking at the git logs, I believe those patches have
> also been handled via the tip tree; so would it be okay to add the
> x86 ML and the tip tree to the VMware hypervisor interface entry too
> in the MAINTAINERS file?

We've routed them through tip, yes. So yes, that's fine to have a
separate entry in the maintainers file which has you and x...@kernel.org
plus the tip tree mentioned.

Thanks,

tglx


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: Which tree for paravirt related patches?

2021-11-04 Thread Thomas Gleixner
On Thu, Nov 04 2021 at 10:17, Thomas Gleixner wrote:

CC+ x86, peterz

> Juergen,
>
> On Thu, Nov 04 2021 at 06:53, Juergen Gross wrote:
>
>> A recent patch modifying the core paravirt-ops functionality is
>> highlighting some missing MAINTAINERS information for PARAVIRT_OPS:
>> there is no information which tree is to be used for taking those
>> patches per default. In the past this was mostly handled by the tip
>> tree, and I think this is fine.
>>
>> X86 maintainers, are you fine with me modifying the PARAVIRT_OPS entry
>> to add the x86 ML and the tip tree? This way such patches will be
>> noticed by you and can be handled accordingly.
>
> Sure.
>
>> An alternative would be to let me carry those patches through the Xen
>> tree, but in lots of those patches some core x86 files are being touched
>> and I think the tip tree is better suited for paravirt handling.
>
> Fair enough.
>
>> And please, could you take a look at:
>>
>> https://lore.kernel.org/virtualization/b8192e8a-13ef-6ac6-6364-8ba58992c...@suse.com/
>>
>> This patch was the one making me notice the problem.
>
> Will do.
>
> Thanks,
>
> Thomas
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: Which tree for paravirt related patches?

2021-11-04 Thread Thomas Gleixner
Juergen,

On Thu, Nov 04 2021 at 06:53, Juergen Gross wrote:

> A recent patch modifying the core paravirt-ops functionality is
> highlighting some missing MAINTAINERS information for PARAVIRT_OPS:
> there is no information which tree is to be used for taking those
> patches per default. In the past this was mostly handled by the tip
> tree, and I think this is fine.
>
> X86 maintainers, are you fine with me modifying the PARAVIRT_OPS entry
> to add the x86 ML and the tip tree? This way such patches will be
> noticed by you and can be handled accordingly.

Sure.

> An alternative would be to let me carry those patches through the Xen
> tree, but in lots of those patches some core x86 files are being touched
> and I think the tip tree is better suited for paravirt handling.

Fair enough.

> And please, could you take a look at:
>
> https://lore.kernel.org/virtualization/b8192e8a-13ef-6ac6-6364-8ba58992c...@suse.com/
>
> This patch was the one making me notice the problem.

Will do.

Thanks,

Thomas
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-17 Thread Thomas Gleixner
On Mon, Oct 18 2021 at 02:55, Thomas Gleixner wrote:
> On Sun, Oct 10 2021 at 15:11, Andi Kleen wrote:
>> The 5.15 tree has something like ~2.4k IO accesses (including MMIO and 
>> others) in init functions that also register drivers (thanks Elena for 
>> the number)
>
> These numbers are completely useless simply because they are based on
> nonsensical criteria. See:
>
>   https://lore.kernel.org/r/87r1cj2uad.ffs@tglx
>
>> My point is just that the ecosystem of devices that Linux supports is 
>> messy enough that there are legitimate exceptions from the "First IO 
>> only in probe call only" rule.
>
> Your point is based on your outright refusal to actualy do a proper
> analysis and your outright refusal to help fixing the real problems.
>
> All you have provided so far is handwaving based on a completely useless
> analysis.
>
> Sure, your goal is to get this TDX problem solved, but it's not going to
> be solved by:
>
>   1) Providing a nonsensical analysis
>
>   2) Using #1 as an argument to hack some half baken interfaces into the
>  kernel which allow you to tick off your checkbox and then leave the
>  resulting mess for others to clean up.
>  
> Try again when you have factual data to back up your claims and factual
> arguments which prove that the problem can't be fixed otherwise.
>
> I might be repeating myself, but kernel development works this way:
>
>   1) Hack your private POC - Yay!
>
>   2) Sit down and think hard about the problems you identified in step
>  #1. Do a thorough analysis.
>   
>   3) Come up with a sensible integration plan.
>
>   4) Do the necessary grump work of cleanups all over the place
>
>   5) Add sensible infrastructure which is understandable for the bulk
>  of kernel/driver developers
>
>   6) Let your feature fall in place
>
> and not in the way you are insisting on:
>
>   1) Hack your private POC - Yay!
>
>   2) Define that this is the only way to do it and try to shove it down
>  the throat of everyone.
>
>   3) Getting told that this is not the way it works
>
>   4) Insist on it forever and blame the grumpy maintainers who are just
>  not understanding the great value of your approach.
>
>   5) Go back to #2
>
> You should know that already, but I have no problem to give that lecture
> to you over and over again. I probably should create a form letter.
>
> And no, you can bitch about me as much as you want. These are not my
> personal rules and personal pet pieves. These are rules Linus cares
> about very much and aside of that they just reflect common sense.
>
>   The kernel is a common good and not the dump ground for your personal
>   brain waste.
>
>   The kernel does not serve Intel. Quite the contrary Intel depends on
>   the kernel to work nicely with it's hardware. Ergo, Intel should have
>   a vested interest to serve the kernel and take responsibility for it
>   as a whole. And so should you as an Intel employee.
>
> Just dumping your next half baken workaround does not cut it especially
> not when it is not backed up by sensible arguments.
>
> Please try again, but not before you have something substantial to back
> up your claims.

That said, I can't resist the urge to say a few words to the responsible
senior and management people at Intel in this context:

I surely know that a lot of Intel people claim that their lack of
progress is _only_ because Thomas is hard to work with and Thomas wants
unreasonable changes to their code, which I could perceive as an abuse of
myself for the purpose of self-deception. TBH, I don't give a damn.

Let me ask a few questions instead:

  - Is it unreasonable to expect that argumentations are based on facts
and proper analysis?

  - Is it unreasonable to expect a proper integration of a new feature?

  - Does it take unreasonable effort to do a proper design?

  - Is it unreasonable to ask that he necessary cleanups are done
upfront?

If anyone of the responsible people at Intel thinks so, then they should
speak up now and tell me in public and into my face what's so
unreasonable about that.

Thanks,

Thomas
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-17 Thread Thomas Gleixner
Andi,

On Sun, Oct 10 2021 at 15:11, Andi Kleen wrote:
> On 10/9/2021 1:39 PM, Dan Williams wrote:
>> I agree with you and Greg here. If a driver is accessing hardware
>> resources outside of the bind lifetime of one of the devices it
>> supports, and in a way that neither modrobe-policy nor
>> device-authorization -policy infrastructure can block, that sounds
>> like a bug report.
>
> The 5.15 tree has something like ~2.4k IO accesses (including MMIO and 
> others) in init functions that also register drivers (thanks Elena for 
> the number)

These numbers are completely useless simply because they are based on
nonsensical criteria. See:

  https://lore.kernel.org/r/87r1cj2uad.ffs@tglx

> My point is just that the ecosystem of devices that Linux supports is 
> messy enough that there are legitimate exceptions from the "First IO 
> only in probe call only" rule.

Your point is based on your outright refusal to actualy do a proper
analysis and your outright refusal to help fixing the real problems.

All you have provided so far is handwaving based on a completely useless
analysis.

Sure, your goal is to get this TDX problem solved, but it's not going to
be solved by:

  1) Providing a nonsensical analysis

  2) Using #1 as an argument to hack some half baken interfaces into the
 kernel which allow you to tick off your checkbox and then leave the
 resulting mess for others to clean up.
 
Try again when you have factual data to back up your claims and factual
arguments which prove that the problem can't be fixed otherwise.

I might be repeating myself, but kernel development works this way:

  1) Hack your private POC - Yay!

  2) Sit down and think hard about the problems you identified in step
 #1. Do a thorough analysis.
  
  3) Come up with a sensible integration plan.

  4) Do the necessary grump work of cleanups all over the place

  5) Add sensible infrastructure which is understandable for the bulk
 of kernel/driver developers

  6) Let your feature fall in place

and not in the way you are insisting on:

  1) Hack your private POC - Yay!

  2) Define that this is the only way to do it and try to shove it down
 the throat of everyone.

  3) Getting told that this is not the way it works

  4) Insist on it forever and blame the grumpy maintainers who are just
 not understanding the great value of your approach.

  5) Go back to #2

You should know that already, but I have no problem to give that lecture
to you over and over again. I probably should create a form letter.

And no, you can bitch about me as much as you want. These are not my
personal rules and personal pet pieves. These are rules Linus cares
about very much and aside of that they just reflect common sense.

  The kernel is a common good and not the dump ground for your personal
  brain waste.

  The kernel does not serve Intel. Quite the contrary Intel depends on
  the kernel to work nicely with it's hardware. Ergo, Intel should have
  a vested interest to serve the kernel and take responsibility for it
  as a whole. And so should you as an Intel employee.

Just dumping your next half baken workaround does not cut it especially
not when it is not backed up by sensible arguments.

Please try again, but not before you have something substantial to back
up your claims.

Thanks,

Thomas
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-17 Thread Thomas Gleixner
Elena,

On Thu, Oct 14 2021 at 06:32, Elena Reshetova wrote:
>> On Tue, Oct 12, 2021 at 06:36:16PM +, Reshetova, Elena wrote:
> It does not make any difference really for the content of the /drivers/*:
> gives 408 __init style functions doing IO (.probe & builtin/module_
>> > _platform_driver_probe excluded) for 5.15 with allmodconfig:
>
> ['doc200x_ident_chip',
> 'doc_probe', 'doc2001_init', 'mtd_speedtest_init',
> 'mtd_nandbiterrs_init', 'mtd_oobtest_init', 'mtd_pagetest_init',
> 'tort_init', 'mtd_subpagetest_init', 'fixup_pmc551',
> 'doc_set_driver_info', 'init_amd76xrom', 'init_l440gx',
> 'init_sc520cdp', 'init_ichxrom', 'init_ck804xrom', 'init_esb2rom',
> 'ubi_gluebi_init', 'ubiblock_init'
> 'ubi_init', 'mtd_stresstest_init',

All of this is MTD and can just be disabled wholesale.

Aside of that, most of these depend on either platform devices or device
tree enumerations which are not ever available on X86.

> 'probe_acpi_namespace_devices',

> 'amd_iommu_init_pci', 'state_next',
> 'init_dmars', 'iommu_init_pci', 'early_amd_iommu_init',
> 'late_iommu_features_init', 'detect_ivrs',
> 'intel_prepare_irq_remapping', 'intel_enable_irq_remapping',
> 'intel_cleanup_irq_remapping', 'detect_intel_iommu',
> 'parse_ioapics_under_ir', 'si_domain_init',
> 'intel_iommu_init', 'dmar_table_init',
> 'enable_drhd_fault_handling',
> 'check_tylersburg_isoch', 

None of this is reachable because the initial detection which is ACPI
table based will fail for TDX. If not, it's a guest firmware problem.

> 'fb_console_init', 'xenbus_probe_backend_init',
> 'xenbus_probe_frontend_init', 'setup_vcpu_hotplug_event',
> 'balloon_init',

XEN, that's relevant because magically the TDX guest will assume that it
is a XEN instance?

> 'ostm_init_clksrc', 'ftm_clockevent_init', 'ftm_clocksource_init',
> 'kona_timer_init', 'mtk_gpt_init', 'samsung_clockevent_init',
> 'samsung_clocksource_init', 'sysctr_timer_init', 'mxs_timer_init',
> 'sun4i_timer_init', 'at91sam926x_pit_dt_init', 'owl_timer_init',
> 'sun5i_setup_clockevent',
> 'mt7621_clk_init',
> 'samsung_clk_register_mux', 'samsung_clk_register_gate',
> 'samsung_clk_register_fixed_rate', 'clk_boston_setup',
> 'gemini_cc_init', 'aspeed_ast2400_cc', 'aspeed_ast2500_cc',
> 'sun6i_rtc_clk_init', 'phy_init', 'ingenic_ost_register_clock',
> 'meson6_timer_init', 'atcpit100_timer_init',
> 'npcm7xx_clocksource_init', 'clksrc_dbx500_prcmu_init',
> 'rcar_sysc_pd_setup', 'r8a779a0_sysc_pd_setup', 'renesas_soc_init',
> 'rcar_rst_init', 'rmobile_setup_pm_domain', 'mcp_write_pairing_set',
> 'a72_b53_rac_enable_all', 'mcp_a72_b53_set',
> 'brcmstb_soc_device_early_init', 'imx8mq_soc_revision',
> 'imx8mm_soc_uid', 'imx8mm_soc_revision', 'qe_init',
> 'exynos5x_clk_init', 'exynos5250_clk_init', 'exynos4_get_xom',
> 'create_one_cmux', 'create_one_pll', 'p2041_init_periph',
> 'p4080_init_periph', 'p5020_init_periph', 'p5040_init_periph',
> 'r9a06g032_clocks_probe', 'r8a73a4_cpg_clocks_init',
> 'sh73a0_cpg_clocks_init', 'cpg_div6_register',
> 'r8a7740_cpg_clocks_init', 'cpg_mssr_register_mod_clk',
> 'cpg_mssr_register_core_clk', 'rcar_gen3_cpg_clk_register',
> 'cpg_sd_clk_register', 'r7s9210_update_clk_table',
> 'rz_cpg_read_mode_pins', 'rz_cpg_clocks_init',
> 'rcar_r8a779a0_cpg_clk_register', 'rcar_gen2_cpg_clk_register',
> 'sun8i_a33_ccu_setup', 'sun8i_a23_ccu_setup', 'sun5i_ccu_init',
> 'suniv_f1c100s_ccu_setup', 'sun6i_a31_ccu_setup',
> 'sun8i_v3_v3s_ccu_init', 'sun50i_h616_ccu_setup',
> 'sunxi_h3_h5_ccu_init', 'sun4i_ccu_init', 'kona_ccu_init',
> 'ns2_genpll_scr_clk_init', 'ns2_genpll_sw_clk_init',
> 'ns2_lcpll_ddr_clk_init', 'ns2_lcpll_ports_clk_init',
> 'nsp_genpll_clk_init', 'nsp_lcpll0_clk_init',
> 'cygnus_genpll_clk_init', 'cygnus_lcpll0_clk_init',
> 'cygnus_mipipll_clk_init', 'cygnus_audiopll_clk_init',
> 'of_fixed_mmio_clk_setup',
> 'arm_v7s_do_selftests', 'arm_lpae_run_tests', 'init_iommu_one',

ARM based drivers are initialized on x86 in which way?

> 'hv_init_tsc_clocksource', 'hv_init_clocksource',

HyperV. See XEN

> 'skx_init',
> 'i10nm_init', 'sbridge_init', 'i82975x_init', 'i3000_init',
> 'x38_init', 'ie31200_init', 'i3200_init', 'amd64_edac_init',
> 'pnd2_init', 'edac_init', 'adummy_init',

EDAC has already hypervisor checks

> 'init_acpi_pm_clocksource',

Requires ACPI table entry or command line override

> 'intel_rng_mod_init',

Has an old style PCI table which is searched via pci_get_device(). Could
do with a cleanup which converts it to proper PCI probing.



So I stop here, because it would be way simpler to have the file names
but so far I could identify all of it from the top of my head.

So what are you trying to tell me? That you found tons of ioremaps in
__init functions which are completely irrelevant.

Please stop making arguments based on completely nonsensical data. It
took me less than 5 minutes to eliminate more than 50% of that list and
I'm pretty sure that I could have eliminated the bulk of the rest as
well.

The fact that a large part of this is ARM only, the fa

Re: [PATCH 7/9] virtio-pci: harden INTX interrupts

2021-09-14 Thread Thomas Gleixner
On Tue, Sep 14 2021 at 13:03, Peter Zijlstra wrote:
> On Mon, Sep 13, 2021 at 11:36:24PM +0200, Thomas Gleixner wrote:
> Here you rely on the UNLOCK+LOCK pattern because we have two adjacent
> critical sections (or rather, the same twice), which provides RCtso
> ordering, which is sufficient to make the below store:
>
>> 
>> intx_soft_enabled = true;
>
> a RELEASE. still, I would suggest writing it at least using
> WRITE_ONCE() with a comment on.

Right. forgot about that.

>   disable_irq();
>   /*
>* The above disable_irq() provides TSO ordering and as such
>* promotes the below store to store-release.
>*/
>   WRITE_ONCE(intx_soft_enabled, true);
>   enable_irq();
>
>> In this case synchronize_irq() prevents the subsequent store to
>> intx_soft_enabled to leak into the __disable_irq(desc) section which in
>> turn makes it impossible for an interrupt handler to observe
>> intx_soft_enabled == true before the prerequisites which preceed the
>> call to disable_irq() are visible.
>> 
>> Of course the memory ordering wizards might disagree, but if they do,
>> then we have a massive chase of ordering problems vs. similar constructs
>> all over the tree ahead of us.
>
> Your case, UNLOCK s + LOCK s, is fully documented to provide RCtso
> ordering. The more general case of: UNLOCK r + LOCK s, will shortly
> appear in documentation near you. Meaning we can forget about the
> details an blanket state that any UNLOCK followed by a LOCK (on the same
> CPU) will provide TSO ordering.

I think we also should document the disable/synchronize_irq() scheme
somewhere.

Thanks,

tglx

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 6/9] virtio_pci: harden MSI-X interrupts

2021-09-14 Thread Thomas Gleixner
On Mon, Sep 13 2021 at 16:54, Michael S. Tsirkin wrote:
> On Mon, Sep 13, 2021 at 09:38:30PM +0200, Thomas Gleixner wrote:
>> and disable it again
>> before reset() is invoked. That's a question of general robustness and
>> not really a question of trusted hypervisors and encrypted guests.
>
> We can do this for some MSIX interrupts, sure. Not for shared interrupts 
> though.

But you have to make sure that the handler does not run before and after
the defined points. And that's even more important for shared because
with shared interrupts the interrupt can be raised at any point in time
via the other devices which share the line.

Thanks,

tglx
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 6/9] virtio_pci: harden MSI-X interrupts

2021-09-13 Thread Thomas Gleixner
On Mon, Sep 13 2021 at 16:54, Michael S. Tsirkin wrote:

> On Mon, Sep 13, 2021 at 09:38:30PM +0200, Thomas Gleixner wrote:
>> On Mon, Sep 13 2021 at 15:07, Jason Wang wrote:
>> > On Mon, Sep 13, 2021 at 2:50 PM Michael S. Tsirkin  wrote:
>> >> > But doen't "irq is disabled" basically mean "we told the hypervisor
>> >> > to disable the irq"?  What extractly prevents hypervisor from
>> >> > sending the irq even if guest thinks it disabled it?
>> >>
>> >> More generally, can't we for example blow away the
>> >> indir_desc array that we use to keep the ctx pointers?
>> >> Won't that be enough?
>> >
>> > I'm not sure how it is related to the indirect descriptor but an
>> > example is that all the current driver will assume:
>> >
>> > 1) the interrupt won't be raised before virtio_device_ready()
>> > 2) the interrupt won't be raised after reset()
>> 
>> If that assumption exists, then you better keep the interrupt line
>> disabled until virtio_device_ready() has completed
>
> started not completed. device is allowed to send
> config interrupts right after DRIVER_OK status is set by
> virtio_device_ready.

Whatever:

 * Define the exact point from which on the driver is able to handle the
   interrupt and put the enable after that point

 * Define the exact point from which on the driver is unable to handle
   the interrupt and put the disable before that point

The above is blury.

>> and disable it again
>> before reset() is invoked. That's a question of general robustness and
>> not really a question of trusted hypervisors and encrypted guests.
>
> We can do this for some MSIX interrupts, sure. Not for shared interrupts 
> though.

See my reply to the next patch. The problem is the same:

 * Define the exact point from which on the driver is able to handle the
   interrupt and allow the handler to proceed after that point

 * Define the exact point from which on the driver is unable to handle
   the interrupt and ensure that the handler denies to proceed before
   that point

Same story just a different mechanism.

Thanks,

tglx
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 7/9] virtio-pci: harden INTX interrupts

2021-09-13 Thread Thomas Gleixner
On Mon, Sep 13 2021 at 18:01, Michael S. Tsirkin wrote:
> On Mon, Sep 13, 2021 at 11:36:24PM +0200, Thomas Gleixner wrote:
>> >From the interrupt perspective the sequence:
>> 
>> disable_irq();
>> vp_dev->intx_soft_enabled = true;
>> enable_irq();
>> 
>> is perfectly fine as well. Any interrupt arriving during the disabled
>> section will be reraised on enable_irq() in hardware because it's a
>> level interrupt. Any resulting failure is either a hardware or a
>> hypervisor bug.
>
> yes but it's a shared interrupt. what happens if multiple callers do
> this in parallel?

Nothing as each caller is serialized vs. itself and its own interrupt
handler it cares about.

Thanks,

tglx
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 7/9] virtio-pci: harden INTX interrupts

2021-09-13 Thread Thomas Gleixner
Jason,

On Mon, Sep 13 2021 at 13:53, Jason Wang wrote:
> This patch tries to make sure the virtio interrupt handler for INTX
> won't be called after a reset and before virtio_device_ready(). We
> can't use IRQF_NO_AUTOEN since we're using shared interrupt
> (IRQF_SHARED). So this patch tracks the INTX enabling status in a new
> intx_soft_enabled variable and toggle it during in
> vp_disable/enable_vectors(). The INTX interrupt handler will check
> intx_soft_enabled before processing the actual interrupt.

Ah, there it is :)

Cc'ed our memory ordering wizards as I might be wrong as usual.

> - if (vp_dev->intx_enabled)
> + if (vp_dev->intx_enabled) {
> + vp_dev->intx_soft_enabled = false;
> + /* ensure the vp_interrupt see this intx_soft_enabled value */
> + smp_wmb();
>   synchronize_irq(vp_dev->pci_dev->irq);

As you are synchronizing the interrupt here anyway, what is the value of
the barrier?

vp_dev->intx_soft_enabled = false;
synchronize_irq(vp_dev->pci_dev->irq);

is sufficient because of:

synchronize_irq()
   do {
raw_spin_lock(desc->lock);
in_progress = check_inprogress(desc);
raw_spin_unlock(desc->lock);
   } while (in_progress); 

raw_spin_lock() has ACQUIRE semantics so the store to intx_soft_enabled
can complete after lock has been acquired which is uninteresting.

raw_spin_unlock() has RELEASE semantics so the store to intx_soft_enabled
has to be completed before the unlock completes.

So if the interrupt is on the flight then it might or might not see
intx_soft_enabled == false. But that's true for your barrier construct
as well.

The important part is that any interrupt for this line arriving after
synchronize_irq() has completed is guaranteed to see intx_soft_enabled
== false.

That is what you want to achieve, right?

>   for (i = 0; i < vp_dev->msix_vectors; ++i)
>   disable_irq(pci_irq_vector(vp_dev->pci_dev, i));
> @@ -43,8 +47,12 @@ void vp_enable_vectors(struct virtio_device *vdev)
>   struct virtio_pci_device *vp_dev = to_vp_device(vdev);
>   int i;
>  
> - if (vp_dev->intx_enabled)
> + if (vp_dev->intx_enabled) {
> + vp_dev->intx_soft_enabled = true;
> + /* ensure the vp_interrupt see this intx_soft_enabled value */
> + smp_wmb();

For the enable case the barrier is pointless vs. intx_soft_enabled

CPU 0   CPU 1

interrupt   vp_enable_vectors()
  vp_interrupt()
if (!vp_dev->intx_soft_enabled)
   return IRQ_NONE;
  vp_dev->intx_soft_enabled = 
true;

IOW, the concurrent interrupt might or might not see the store. That's
not a problem for legacy PCI interrupts. If it did not see the store and
the interrupt originated from that device then it will account it as one
spurious interrupt which will get raised again because those interrupts
are level triggered and nothing acknowledged it at the device level.

Now, what's more interesting is that is has to be guaranteed that the
interrupt which observes

vp_dev->intx_soft_enabled == true

also observes all preceeding stores, i.e. those which make the interrupt
handler capable of handling the interrupt.

That's the real problem and for that your barrier is at the wrong place
because you want to make sure that those stores are visible before the
store to intx_soft_enabled becomes visible, i.e. this should be:


/* Ensure that all preceeding stores are visible before 
intx_soft_enabled */
smp_wmb();
vp_dev->intx_soft_enabled = true;

Now Micheal is not really enthusiatic about the barrier in the interrupt
handler hotpath, which is understandable.

As the device startup is not really happening often it's sensible to do
the following

disable_irq();
vp_dev->intx_soft_enabled = true;
enable_irq();

because:

disable_irq()
  synchronize_irq()

acts as a barrier for the preceeding stores:

disable_irq()
  raw_spin_lock(desc->lock);
  __disable_irq(desc);
  raw_spin_unlock(desc->lock);

  synchronize_irq()
do {
  raw_spin_lock(desc->lock);
  in_progress = check_inprogress(desc);
  raw_spin_unlock(desc->lock);
} while (in_progress); 

intx_soft_enabled = true;

enable_irq();

In this case synchronize_irq() prevents the subsequent store to
intx_soft_enabled to leak into the __disable_irq(desc) section which in
turn makes it impossible for an interrupt handler to observe
intx_soft_enabled == true before the prerequisites which preceed the
call to disable_irq() are visible.

Of course the memory ordering wizards might disagree, but if they do,
then we have a massive chase of ordering problems vs. similar constructs
all ove

Re: [PATCH 6/9] virtio_pci: harden MSI-X interrupts

2021-09-13 Thread Thomas Gleixner
On Mon, Sep 13 2021 at 15:07, Jason Wang wrote:
> On Mon, Sep 13, 2021 at 2:50 PM Michael S. Tsirkin  wrote:
>> > But doen't "irq is disabled" basically mean "we told the hypervisor
>> > to disable the irq"?  What extractly prevents hypervisor from
>> > sending the irq even if guest thinks it disabled it?
>>
>> More generally, can't we for example blow away the
>> indir_desc array that we use to keep the ctx pointers?
>> Won't that be enough?
>
> I'm not sure how it is related to the indirect descriptor but an
> example is that all the current driver will assume:
>
> 1) the interrupt won't be raised before virtio_device_ready()
> 2) the interrupt won't be raised after reset()

If that assumption exists, then you better keep the interrupt line
disabled until virtio_device_ready() has completed and disable it again
before reset() is invoked. That's a question of general robustness and
not really a question of trusted hypervisors and encrypted guests.

>> > > > > > > +void vp_disable_vectors(struct virtio_device *vdev)
>> > > > > > >  {
>> > > > > > >   struct virtio_pci_device *vp_dev = to_vp_device(vdev);
>> > > > > > >   int i;
>> > > > > > > @@ -34,7 +34,20 @@ void vp_synchronize_vectors(struct 
>> > > > > > > virtio_device *vdev)
>> > > > > > >   synchronize_irq(vp_dev->pci_dev->irq);

Don't you want the same change for non-MSI interrupts?

Thanks,

tglx
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V2 5/6] virtio: add one field into virtio_device for recording if device uses managed irq

2021-07-07 Thread Thomas Gleixner
On Tue, Jul 06 2021 at 07:42, Christoph Hellwig wrote:
> On Fri, Jul 02, 2021 at 11:05:54PM +0800, Ming Lei wrote:
>> blk-mq needs to know if the device uses managed irq, so add one field
>> to virtio_device for recording if device uses managed irq.
>> 
>> If the driver use managed irq, this flag has to be set so it can be
>> passed to blk-mq.
>
> I don't think all this boilerplate code make a whole lot of sense.
> I think we need to record this information deep down in the irq code by
> setting a flag in struct device only if pci_alloc_irq_vectors_affinity
> atually managed to allocate multiple vectors and the PCI_IRQ_AFFINITY
> flag was set.  Then blk-mq can look at that flag, and also check that
> more than one queue is in used and work based on that.

Ack.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[patch 5/7] drm/nouveau/device: Replace io_mapping_map_atomic_wc()

2021-03-04 Thread Thomas Gleixner
From: Thomas Gleixner 

Neither fbmem_peek() nor fbmem_poke() require to disable pagefaults and
preemption as a side effect of io_mapping_map_atomic_wc().

Use io_mapping_map_local_wc() instead.

Signed-off-by: Thomas Gleixner 
Cc: Ben Skeggs 
Cc: David Airlie 
Cc: Daniel Vetter 
Cc: dri-de...@lists.freedesktop.org
Cc: nouv...@lists.freedesktop.org
---
 drivers/gpu/drm/nouveau/nvkm/subdev/devinit/fbmem.h |8 
 1 file changed, 4 insertions(+), 4 deletions(-)

--- a/drivers/gpu/drm/nouveau/nvkm/subdev/devinit/fbmem.h
+++ b/drivers/gpu/drm/nouveau/nvkm/subdev/devinit/fbmem.h
@@ -60,19 +60,19 @@ fbmem_fini(struct io_mapping *fb)
 static inline u32
 fbmem_peek(struct io_mapping *fb, u32 off)
 {
-   u8 __iomem *p = io_mapping_map_atomic_wc(fb, off & PAGE_MASK);
+   u8 __iomem *p = io_mapping_map_local_wc(fb, off & PAGE_MASK);
u32 val = ioread32(p + (off & ~PAGE_MASK));
-   io_mapping_unmap_atomic(p);
+   io_mapping_unmap_local(p);
return val;
 }
 
 static inline void
 fbmem_poke(struct io_mapping *fb, u32 off, u32 val)
 {
-   u8 __iomem *p = io_mapping_map_atomic_wc(fb, off & PAGE_MASK);
+   u8 __iomem *p = io_mapping_map_local_wc(fb, off & PAGE_MASK);
iowrite32(val, p + (off & ~PAGE_MASK));
wmb();
-   io_mapping_unmap_atomic(p);
+   io_mapping_unmap_local(p);
 }
 
 static inline bool


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[patch 2/7] drm/vmgfx: Replace kmap_atomic()

2021-03-04 Thread Thomas Gleixner
From: Thomas Gleixner 

There is no reason to disable pagefaults and preemption as a side effect of
kmap_atomic_prot().

Use kmap_local_page_prot() instead and document the reasoning for the
mapping usage with the given pgprot.

Remove the NULL pointer check for the map. These functions return a valid
address for valid pages and the return was bogus anyway as it would have
left preemption and pagefaults disabled.

Signed-off-by: Thomas Gleixner 
Cc: VMware Graphics 
Cc: Roland Scheidegger 
Cc: Zack Rusin 
Cc: David Airlie 
Cc: Daniel Vetter 
Cc: dri-de...@lists.freedesktop.org
---
 drivers/gpu/drm/vmwgfx/vmwgfx_blit.c |   30 --
 1 file changed, 12 insertions(+), 18 deletions(-)

--- a/drivers/gpu/drm/vmwgfx/vmwgfx_blit.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_blit.c
@@ -375,12 +375,12 @@ static int vmw_bo_cpu_blit_line(struct v
copy_size = min_t(u32, copy_size, PAGE_SIZE - src_page_offset);
 
if (unmap_src) {
-   kunmap_atomic(d->src_addr);
+   kunmap_local(d->src_addr);
d->src_addr = NULL;
}
 
if (unmap_dst) {
-   kunmap_atomic(d->dst_addr);
+   kunmap_local(d->dst_addr);
d->dst_addr = NULL;
}
 
@@ -388,12 +388,8 @@ static int vmw_bo_cpu_blit_line(struct v
if (WARN_ON_ONCE(dst_page >= d->dst_num_pages))
return -EINVAL;
 
-   d->dst_addr =
-   kmap_atomic_prot(d->dst_pages[dst_page],
-d->dst_prot);
-   if (!d->dst_addr)
-   return -ENOMEM;
-
+   d->dst_addr = 
kmap_local_page_prot(d->dst_pages[dst_page],
+  d->dst_prot);
d->mapped_dst = dst_page;
}
 
@@ -401,12 +397,8 @@ static int vmw_bo_cpu_blit_line(struct v
if (WARN_ON_ONCE(src_page >= d->src_num_pages))
return -EINVAL;
 
-   d->src_addr =
-   kmap_atomic_prot(d->src_pages[src_page],
-d->src_prot);
-   if (!d->src_addr)
-   return -ENOMEM;
-
+   d->src_addr = 
kmap_local_page_prot(d->src_pages[src_page],
+  d->src_prot);
d->mapped_src = src_page;
}
diff->do_cpy(diff, d->dst_addr + dst_page_offset,
@@ -436,8 +428,10 @@ static int vmw_bo_cpu_blit_line(struct v
  *
  * Performs a CPU blit from one buffer object to another avoiding a full
  * bo vmap which may exhaust- or fragment vmalloc space.
- * On supported architectures (x86), we're using kmap_atomic which avoids
- * cross-processor TLB- and cache flushes and may, on non-HIGHMEM systems
+ *
+ * On supported architectures (x86), we're using kmap_local_prot() which
+ * avoids cross-processor TLB- and cache flushes. kmap_local_prot() will
+ * either map a highmem page with the proper pgprot on HIGHMEM=y systems or
  * reference already set-up mappings.
  *
  * Neither of the buffer objects may be placed in PCI memory
@@ -500,9 +494,9 @@ int vmw_bo_cpu_blit(struct ttm_buffer_ob
}
 out:
if (d.src_addr)
-   kunmap_atomic(d.src_addr);
+   kunmap_local(d.src_addr);
if (d.dst_addr)
-   kunmap_atomic(d.dst_addr);
+   kunmap_local(d.dst_addr);
 
return ret;
 }


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[patch 6/7] drm/i915: Replace io_mapping_map_atomic_wc()

2021-03-04 Thread Thomas Gleixner
From: Thomas Gleixner 

None of these mapping requires the side effect of disabling pagefaults and
preemption.

Use io_mapping_map_local_wc() instead, and clean up gtt_user_read() and
gtt_user_write() to use a plain copy_from_user() as the local maps are not
disabling pagefaults.

Signed-off-by: Thomas Gleixner 
Cc: Jani Nikula 
Cc: Joonas Lahtinen 
Cc: Rodrigo Vivi 
Cc: David Airlie 
Cc: Daniel Vetter 
Cc: Chris Wilson 
Cc: intel-...@lists.freedesktop.org
Cc: dri-de...@lists.freedesktop.org
---
 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c |7 +---
 drivers/gpu/drm/i915/i915_gem.c|   40 -
 drivers/gpu/drm/i915/selftests/i915_gem.c  |4 +-
 drivers/gpu/drm/i915/selftests/i915_gem_gtt.c  |8 ++---
 4 files changed, 22 insertions(+), 37 deletions(-)

--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -1080,7 +1080,7 @@ static void reloc_cache_reset(struct rel
struct i915_ggtt *ggtt = cache_to_ggtt(cache);
 
intel_gt_flush_ggtt_writes(ggtt->vm.gt);
-   io_mapping_unmap_atomic((void __iomem *)vaddr);
+   io_mapping_unmap_local((void __iomem *)vaddr);
 
if (drm_mm_node_allocated(&cache->node)) {
ggtt->vm.clear_range(&ggtt->vm,
@@ -1146,7 +1146,7 @@ static void *reloc_iomap(struct drm_i915
 
if (cache->vaddr) {
intel_gt_flush_ggtt_writes(ggtt->vm.gt);
-   io_mapping_unmap_atomic((void __force __iomem *) 
unmask_page(cache->vaddr));
+   io_mapping_unmap_local((void __force __iomem *) 
unmask_page(cache->vaddr));
} else {
struct i915_vma *vma;
int err;
@@ -1194,8 +1194,7 @@ static void *reloc_iomap(struct drm_i915
offset += page << PAGE_SHIFT;
}
 
-   vaddr = (void __force *)io_mapping_map_atomic_wc(&ggtt->iomap,
-offset);
+   vaddr = (void __force *)io_mapping_map_local_wc(&ggtt->iomap, offset);
cache->page = page;
cache->vaddr = (unsigned long)vaddr;
 
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -253,22 +253,15 @@ gtt_user_read(struct io_mapping *mapping
  char __user *user_data, int length)
 {
void __iomem *vaddr;
-   unsigned long unwritten;
+   bool fail = false;
 
/* We can use the cpu mem copy function because this is X86. */
-   vaddr = io_mapping_map_atomic_wc(mapping, base);
-   unwritten = __copy_to_user_inatomic(user_data,
-   (void __force *)vaddr + offset,
-   length);
-   io_mapping_unmap_atomic(vaddr);
-   if (unwritten) {
-   vaddr = io_mapping_map_wc(mapping, base, PAGE_SIZE);
-   unwritten = copy_to_user(user_data,
-(void __force *)vaddr + offset,
-length);
-   io_mapping_unmap(vaddr);
-   }
-   return unwritten;
+   vaddr = io_mapping_map_local_wc(mapping, base);
+   if (copy_to_user(user_data, (void __force *)vaddr + offset, length))
+   fail = true;
+   io_mapping_unmap_local(vaddr);
+
+   return fail;
 }
 
 static int
@@ -437,21 +430,14 @@ ggtt_write(struct io_mapping *mapping,
   char __user *user_data, int length)
 {
void __iomem *vaddr;
-   unsigned long unwritten;
+   bool fail = false;
 
/* We can use the cpu mem copy function because this is X86. */
-   vaddr = io_mapping_map_atomic_wc(mapping, base);
-   unwritten = __copy_from_user_inatomic_nocache((void __force *)vaddr + 
offset,
- user_data, length);
-   io_mapping_unmap_atomic(vaddr);
-   if (unwritten) {
-   vaddr = io_mapping_map_wc(mapping, base, PAGE_SIZE);
-   unwritten = copy_from_user((void __force *)vaddr + offset,
-  user_data, length);
-   io_mapping_unmap(vaddr);
-   }
-
-   return unwritten;
+   vaddr = io_mapping_map_local_wc(mapping, base);
+   if (copy_from_user((void __force *)vaddr + offset, user_data, length))
+   fail = true;
+   io_mapping_unmap_local(vaddr);
+   return fail;
 }
 
 /**
--- a/drivers/gpu/drm/i915/selftests/i915_gem.c
+++ b/drivers/gpu/drm/i915/selftests/i915_gem.c
@@ -58,12 +58,12 @@ static void trash_stolen(struct drm_i915
 
ggtt->vm.insert_page(&ggtt->vm, dma, slot, I915_CACHE_NONE, 0);
 
-   s = io_mapping_map_atomic_wc(&ggtt->iomap, slot);
+   s = io_mapping_map_local_wc(&ggtt->iomap, slot);
for (x = 0; x < PAG

[patch 7/7] io-mapping: Remove io_mapping_map_atomic_wc()

2021-03-04 Thread Thomas Gleixner
From: Thomas Gleixner 

No more users. Get rid of it and remove the traces in documentation.

Signed-off-by: Thomas Gleixner 
Cc: Andrew Morton 
Cc: linux...@kvack.org
---
 Documentation/driver-api/io-mapping.rst |   22 +---
 include/linux/io-mapping.h  |   42 +---
 2 files changed, 9 insertions(+), 55 deletions(-)

--- a/Documentation/driver-api/io-mapping.rst
+++ b/Documentation/driver-api/io-mapping.rst
@@ -21,19 +21,15 @@ mappable, while 'size' indicates how lar
 enable. Both are in bytes.
 
 This _wc variant provides a mapping which may only be used with
-io_mapping_map_atomic_wc(), io_mapping_map_local_wc() or
-io_mapping_map_wc().
+io_mapping_map_local_wc() or io_mapping_map_wc().
 
 With this mapping object, individual pages can be mapped either temporarily
 or long term, depending on the requirements. Of course, temporary maps are
-more efficient. They come in two flavours::
+more efficient.
 
void *io_mapping_map_local_wc(struct io_mapping *mapping,
  unsigned long offset)
 
-   void *io_mapping_map_atomic_wc(struct io_mapping *mapping,
-  unsigned long offset)
-
 'offset' is the offset within the defined mapping region.  Accessing
 addresses beyond the region specified in the creation function yields
 undefined results. Using an offset which is not page aligned yields an
@@ -50,9 +46,6 @@ io_mapping_map_local_wc() has a side eff
 migration to make the mapping code work. No caller can rely on this side
 effect.
 
-io_mapping_map_atomic_wc() has the side effect of disabling preemption and
-pagefaults. Don't use in new code. Use io_mapping_map_local_wc() instead.
-
 Nested mappings need to be undone in reverse order because the mapping
 code uses a stack for keeping track of them::
 
@@ -65,11 +58,10 @@ Nested mappings need to be undone in rev
 The mappings are released with::
 
void io_mapping_unmap_local(void *vaddr)
-   void io_mapping_unmap_atomic(void *vaddr)
 
-'vaddr' must be the value returned by the last io_mapping_map_local_wc() or
-io_mapping_map_atomic_wc() call. This unmaps the specified mapping and
-undoes the side effects of the mapping functions.
+'vaddr' must be the value returned by the last io_mapping_map_local_wc()
+call. This unmaps the specified mapping and undoes eventual side effects of
+the mapping function.
 
 If you need to sleep while holding a mapping, you can use the regular
 variant, although this may be significantly slower::
@@ -77,8 +69,8 @@ If you need to sleep while holding a map
void *io_mapping_map_wc(struct io_mapping *mapping,
unsigned long offset)
 
-This works like io_mapping_map_atomic/local_wc() except it has no side
-effects and the pointer is globaly visible.
+This works like io_mapping_map_local_wc() except it has no side effects and
+the pointer is globaly visible.
 
 The mappings are released with::
 
--- a/include/linux/io-mapping.h
+++ b/include/linux/io-mapping.h
@@ -60,28 +60,7 @@ io_mapping_fini(struct io_mapping *mappi
iomap_free(mapping->base, mapping->size);
 }
 
-/* Atomic map/unmap */
-static inline void __iomem *
-io_mapping_map_atomic_wc(struct io_mapping *mapping,
-unsigned long offset)
-{
-   resource_size_t phys_addr;
-
-   BUG_ON(offset >= mapping->size);
-   phys_addr = mapping->base + offset;
-   preempt_disable();
-   pagefault_disable();
-   return __iomap_local_pfn_prot(PHYS_PFN(phys_addr), mapping->prot);
-}
-
-static inline void
-io_mapping_unmap_atomic(void __iomem *vaddr)
-{
-   kunmap_local_indexed((void __force *)vaddr);
-   pagefault_enable();
-   preempt_enable();
-}
-
+/* Temporary mappings which are only valid in the current context */
 static inline void __iomem *
 io_mapping_map_local_wc(struct io_mapping *mapping, unsigned long offset)
 {
@@ -163,24 +142,7 @@ io_mapping_unmap(void __iomem *vaddr)
 {
 }
 
-/* Atomic map/unmap */
-static inline void __iomem *
-io_mapping_map_atomic_wc(struct io_mapping *mapping,
-unsigned long offset)
-{
-   preempt_disable();
-   pagefault_disable();
-   return io_mapping_map_wc(mapping, offset, PAGE_SIZE);
-}
-
-static inline void
-io_mapping_unmap_atomic(void __iomem *vaddr)
-{
-   io_mapping_unmap(vaddr);
-   pagefault_enable();
-   preempt_enable();
-}
-
+/* Temporary mappings which are only valid in the current context */
 static inline void __iomem *
 io_mapping_map_local_wc(struct io_mapping *mapping, unsigned long offset)
 {

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[patch 4/7] drm/qxl: Replace io_mapping_map_atomic_wc()

2021-03-04 Thread Thomas Gleixner
From: Thomas Gleixner 

None of these mapping requires the side effect of disabling pagefaults and
preemption.

Use io_mapping_map_local_wc() instead, rename the related functions
accordingly and clean up qxl_process_single_command() to use a plain
copy_from_user() as the local maps are not disabling pagefaults.

Signed-off-by: Thomas Gleixner 
Cc: David Airlie 
Cc: Gerd Hoffmann 
Cc: Daniel Vetter 
Cc: virtualization@lists.linux-foundation.org
Cc: spice-de...@lists.freedesktop.org
Cc: dri-de...@lists.freedesktop.org
---
 drivers/gpu/drm/qxl/qxl_image.c   |   18 +-
 drivers/gpu/drm/qxl/qxl_ioctl.c   |   27 +--
 drivers/gpu/drm/qxl/qxl_object.c  |   12 ++--
 drivers/gpu/drm/qxl/qxl_object.h  |4 ++--
 drivers/gpu/drm/qxl/qxl_release.c |4 ++--
 5 files changed, 32 insertions(+), 33 deletions(-)

--- a/drivers/gpu/drm/qxl/qxl_image.c
+++ b/drivers/gpu/drm/qxl/qxl_image.c
@@ -124,12 +124,12 @@ qxl_image_init_helper(struct qxl_device
  wrong (check the bitmaps are sent correctly
  first) */
 
-   ptr = qxl_bo_kmap_atomic_page(qdev, chunk_bo, 0);
+   ptr = qxl_bo_kmap_local_page(qdev, chunk_bo, 0);
chunk = ptr;
chunk->data_size = height * chunk_stride;
chunk->prev_chunk = 0;
chunk->next_chunk = 0;
-   qxl_bo_kunmap_atomic_page(qdev, chunk_bo, ptr);
+   qxl_bo_kunmap_local_page(qdev, chunk_bo, ptr);
 
{
void *k_data, *i_data;
@@ -143,7 +143,7 @@ qxl_image_init_helper(struct qxl_device
i_data = (void *)data;
 
while (remain > 0) {
-   ptr = qxl_bo_kmap_atomic_page(qdev, chunk_bo, 
page << PAGE_SHIFT);
+   ptr = qxl_bo_kmap_local_page(qdev, chunk_bo, 
page << PAGE_SHIFT);
 
if (page == 0) {
chunk = ptr;
@@ -157,7 +157,7 @@ qxl_image_init_helper(struct qxl_device
 
memcpy(k_data, i_data, size);
 
-   qxl_bo_kunmap_atomic_page(qdev, chunk_bo, ptr);
+   qxl_bo_kunmap_local_page(qdev, chunk_bo, ptr);
i_data += size;
remain -= size;
page++;
@@ -175,10 +175,10 @@ qxl_image_init_helper(struct qxl_device
page_offset = 
offset_in_page(out_offset);
size = min((int)(PAGE_SIZE - 
page_offset), remain);
 
-   ptr = qxl_bo_kmap_atomic_page(qdev, 
chunk_bo, page_base);
+   ptr = qxl_bo_kmap_local_page(qdev, 
chunk_bo, page_base);
k_data = ptr + page_offset;
memcpy(k_data, i_data, size);
-   qxl_bo_kunmap_atomic_page(qdev, 
chunk_bo, ptr);
+   qxl_bo_kunmap_local_page(qdev, 
chunk_bo, ptr);
remain -= size;
i_data += size;
out_offset += size;
@@ -189,7 +189,7 @@ qxl_image_init_helper(struct qxl_device
qxl_bo_kunmap(chunk_bo);
 
image_bo = dimage->bo;
-   ptr = qxl_bo_kmap_atomic_page(qdev, image_bo, 0);
+   ptr = qxl_bo_kmap_local_page(qdev, image_bo, 0);
image = ptr;
 
image->descriptor.id = 0;
@@ -212,7 +212,7 @@ qxl_image_init_helper(struct qxl_device
break;
default:
DRM_ERROR("unsupported image bit depth\n");
-   qxl_bo_kunmap_atomic_page(qdev, image_bo, ptr);
+   qxl_bo_kunmap_local_page(qdev, image_bo, ptr);
return -EINVAL;
}
image->u.bitmap.flags = QXL_BITMAP_TOP_DOWN;
@@ -222,7 +222,7 @@ qxl_image_init_helper(struct qxl_device
image->u.bitmap.palette = 0;
image->u.bitmap.data = qxl_bo_physical_address(qdev, chunk_bo, 0);
 
-   qxl_bo_kunmap_atomic_page(qdev, image_bo, ptr);
+   qxl_bo_kunmap_local_page(qdev, image_bo, ptr);
 
return 0;
 }
--- a/drivers/gpu/drm/qxl/qxl_ioctl.c
+++ b/drivers/gpu/drm/qxl/qxl_ioctl.c
@@ -89,11 +89,11 @@ apply_reloc(struct qxl_device *qdev, str
 {
void *reloc_page;
 
-   reloc_page = qxl_bo_kmap_atomic_page(qdev, info->dst_bo, 
info->dst_offset & PAGE_MASK);
+   reloc_page = qxl_bo_kmap_local_page(qdev, info->dst_bo, 
info->dst_offset & PAGE_MASK);
*(uint64_t *)(reloc_page + (info->dst_offset & ~PAGE_MASK)) = 
qxl_bo_physical_address(qdev,

  info->src_bo,

[patch 1/7] drm/ttm: Replace kmap_atomic() usage

2021-03-04 Thread Thomas Gleixner
From: Thomas Gleixner 

There is no reason to disable pagefaults and preemption as a side effect of
kmap_atomic_prot().

Use kmap_local_page_prot() instead and document the reasoning for the
mapping usage with the given pgprot.

Remove the NULL pointer check for the map. These functions return a valid
address for valid pages and the return was bogus anyway as it would have
left preemption and pagefaults disabled.

Signed-off-by: Thomas Gleixner 
Cc: Christian Koenig 
Cc: Huang Rui 
Cc: David Airlie 
Cc: Daniel Vetter 
Cc: dri-de...@lists.freedesktop.org
---
 drivers/gpu/drm/ttm/ttm_bo_util.c |   20 
 1 file changed, 12 insertions(+), 8 deletions(-)

--- a/drivers/gpu/drm/ttm/ttm_bo_util.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_util.c
@@ -181,13 +181,15 @@ static int ttm_copy_io_ttm_page(struct t
return -ENOMEM;
 
src = (void *)((unsigned long)src + (page << PAGE_SHIFT));
-   dst = kmap_atomic_prot(d, prot);
-   if (!dst)
-   return -ENOMEM;
+   /*
+* Ensure that a highmem page is mapped with the correct
+* pgprot. For non highmem the mapping is already there.
+*/
+   dst = kmap_local_page_prot(d, prot);
 
memcpy_fromio(dst, src, PAGE_SIZE);
 
-   kunmap_atomic(dst);
+   kunmap_local(dst);
 
return 0;
 }
@@ -203,13 +205,15 @@ static int ttm_copy_ttm_io_page(struct t
return -ENOMEM;
 
dst = (void *)((unsigned long)dst + (page << PAGE_SHIFT));
-   src = kmap_atomic_prot(s, prot);
-   if (!src)
-   return -ENOMEM;
+   /*
+* Ensure that a highmem page is mapped with the correct
+* pgprot. For non highmem the mapping is already there.
+*/
+   src = kmap_local_page_prot(s, prot);
 
memcpy_toio(dst, src, PAGE_SIZE);
 
-   kunmap_atomic(src);
+   kunmap_local(src);
 
return 0;
 }


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[patch 3/7] highmem: Remove kmap_atomic_prot()

2021-03-04 Thread Thomas Gleixner
From: Thomas Gleixner 

No more users.

Signed-off-by: Thomas Gleixner 
Cc: Andrew Morton 
Cc: linux...@kvack.org
---
 include/linux/highmem-internal.h |   14 ++
 1 file changed, 2 insertions(+), 12 deletions(-)

--- a/include/linux/highmem-internal.h
+++ b/include/linux/highmem-internal.h
@@ -88,16 +88,11 @@ static inline void __kunmap_local(void *
kunmap_local_indexed(vaddr);
 }
 
-static inline void *kmap_atomic_prot(struct page *page, pgprot_t prot)
+static inline void *kmap_atomic(struct page *page)
 {
preempt_disable();
pagefault_disable();
-   return __kmap_local_page_prot(page, prot);
-}
-
-static inline void *kmap_atomic(struct page *page)
-{
-   return kmap_atomic_prot(page, kmap_prot);
+   return __kmap_local_page_prot(page, kmap_prot);
 }
 
 static inline void *kmap_atomic_pfn(unsigned long pfn)
@@ -184,11 +179,6 @@ static inline void *kmap_atomic(struct p
return page_address(page);
 }
 
-static inline void *kmap_atomic_prot(struct page *page, pgprot_t prot)
-{
-   return kmap_atomic(page);
-}
-
 static inline void *kmap_atomic_pfn(unsigned long pfn)
 {
return kmap_atomic(pfn_to_page(pfn));

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[patch 0/7] drm, highmem: Cleanup io/kmap_atomic*() usage

2021-03-04 Thread Thomas Gleixner
None of the DRM usage sites of temporary mappings requires the side
effects of io/kmap_atomic(), i.e. preemption and pagefault disable.

Replace them with the io/kmap_local() variants, simplify the
copy_to/from_user() error handling and remove the atomic variants.

Thanks,

tglx
---
 Documentation/driver-api/io-mapping.rst |   22 +++---
 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c  |7 +--
 drivers/gpu/drm/i915/i915_gem.c |   40 ++-
 drivers/gpu/drm/i915/selftests/i915_gem.c   |4 -
 drivers/gpu/drm/i915/selftests/i915_gem_gtt.c   |8 +--
 drivers/gpu/drm/nouveau/nvkm/subdev/devinit/fbmem.h |8 +--
 drivers/gpu/drm/qxl/qxl_image.c |   18 
 drivers/gpu/drm/qxl/qxl_ioctl.c |   27 ++--
 drivers/gpu/drm/qxl/qxl_object.c|   12 ++---
 drivers/gpu/drm/qxl/qxl_object.h|4 -
 drivers/gpu/drm/qxl/qxl_release.c   |4 -
 drivers/gpu/drm/ttm/ttm_bo_util.c   |   20 +
 drivers/gpu/drm/vmwgfx/vmwgfx_blit.c|   30 +-
 include/linux/highmem-internal.h|   14 --
 include/linux/io-mapping.h  |   42 
 15 files changed, 93 insertions(+), 167 deletions(-)
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


x86/ioapic: Cleanup the timer_works() irqflags mess

2020-12-10 Thread Thomas Gleixner
Mark tripped over the creative irqflags handling in the IO-APIC timer
delivery check which ends up doing:

local_irq_save(flags);
local_irq_enable();
local_irq_restore(flags);

which triggered a new consistency check he's working on required for
replacing the POPF based restore with a conditional STI.

That code is a historical mess and none of this is needed. Make it
straightforward use local_irq_disable()/enable() as that's all what is
required. It is invoked from interrupt enabled code nowadays.

Reported-by: Mark Rutland 
Signed-off-by: Thomas Gleixner 
Tested-by: Mark Rutland 
---
 arch/x86/kernel/apic/io_apic.c |   22 ++
 1 file changed, 6 insertions(+), 16 deletions(-)

--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -1618,21 +1618,16 @@ static void __init delay_without_tsc(voi
 static int __init timer_irq_works(void)
 {
unsigned long t1 = jiffies;
-   unsigned long flags;
 
if (no_timer_check)
return 1;
 
-   local_save_flags(flags);
local_irq_enable();
-
if (boot_cpu_has(X86_FEATURE_TSC))
delay_with_tsc();
else
delay_without_tsc();
 
-   local_irq_restore(flags);
-
/*
 * Expect a few ticks at least, to be sure some possible
 * glue logic does not lock up after one or two first
@@ -1641,10 +1636,10 @@ static int __init timer_irq_works(void)
 * least one tick may be lost due to delays.
 */
 
-   /* jiffies wrap? */
-   if (time_after(jiffies, t1 + 4))
-   return 1;
-   return 0;
+   local_irq_disable();
+
+   /* Did jiffies advance? */
+   return time_after(jiffies, t1 + 4);
 }
 
 /*
@@ -2117,13 +2112,12 @@ static inline void __init check_timer(vo
struct irq_cfg *cfg = irqd_cfg(irq_data);
int node = cpu_to_node(0);
int apic1, pin1, apic2, pin2;
-   unsigned long flags;
int no_pin1 = 0;
 
if (!global_clock_event)
return;
 
-   local_irq_save(flags);
+   local_irq_disable();
 
/*
 * get/set the timer IRQ vector:
@@ -2191,7 +2185,6 @@ static inline void __init check_timer(vo
goto out;
}
panic_if_irq_remap("timer doesn't work through 
Interrupt-remapped IO-APIC");
-   local_irq_disable();
clear_IO_APIC_pin(apic1, pin1);
if (!no_pin1)
apic_printk(APIC_QUIET, KERN_ERR "..MP-BIOS bug: "
@@ -2215,7 +2208,6 @@ static inline void __init check_timer(vo
/*
 * Cleanup, just in case ...
 */
-   local_irq_disable();
legacy_pic->mask(0);
clear_IO_APIC_pin(apic2, pin2);
apic_printk(APIC_QUIET, KERN_INFO "... failed.\n");
@@ -2232,7 +2224,6 @@ static inline void __init check_timer(vo
apic_printk(APIC_QUIET, KERN_INFO ". works.\n");
goto out;
}
-   local_irq_disable();
legacy_pic->mask(0);
apic_write(APIC_LVT0, APIC_LVT_MASKED | APIC_DM_FIXED | cfg->vector);
apic_printk(APIC_QUIET, KERN_INFO ". failed.\n");
@@ -2251,7 +2242,6 @@ static inline void __init check_timer(vo
apic_printk(APIC_QUIET, KERN_INFO ". works.\n");
goto out;
}
-   local_irq_disable();
apic_printk(APIC_QUIET, KERN_INFO ". failed :(.\n");
if (apic_is_x2apic_enabled())
apic_printk(APIC_QUIET, KERN_INFO
@@ -2260,7 +2250,7 @@ static inline void __init check_timer(vo
panic("IO-APIC + timer doesn't work!  Boot with apic=debug and send a "
"report.  Then try booting with the 'noapic' option.\n");
 out:
-   local_irq_restore(flags);
+   local_irq_enable();
 }
 
 /*
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 03/12] x86/pv: switch SWAPGS to ALTERNATIVE

2020-12-09 Thread Thomas Gleixner
On Fri, Nov 20 2020 at 12:46, Juergen Gross wrote:
> SWAPGS is used only for interrupts coming from user mode or for
> returning to user mode. So there is no reason to use the PARAVIRT
> framework, as it can easily be replaced by an ALTERNATIVE depending
> on X86_FEATURE_XENPV.
>
> There are several instances using the PV-aware SWAPGS macro in paths
> which are never executed in a Xen PV guest. Replace those with the
> plain swapgs instruction. For SWAPGS_UNSAFE_STACK the same applies.
>
> Signed-off-by: Juergen Gross 
> Acked-by: Andy Lutomirski 
> Acked-by: Peter Zijlstra (Intel) 

Reviewed-by: Thomas Gleixner 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 05/12] x86: rework arch_local_irq_restore() to not use popf

2020-12-09 Thread Thomas Gleixner
On Wed, Dec 09 2020 at 18:15, Mark Rutland wrote:
> In arch/x86/kernel/apic/io_apic.c's timer_irq_works() we do:
>
>   local_irq_save(flags);
>   local_irq_enable();
>
>   [ trigger an IRQ here ]
>
>   local_irq_restore(flags);
>
> ... and in check_timer() we call that a number of times after either a
> local_irq_save() or local_irq_disable(), eventually trailing with a
> local_irq_disable() that will balance things up before calling
> local_irq_restore().
>
> I guess that timer_irq_works() should instead do:
>
>   local_irq_save(flags);
>   local_irq_enable();
>   ...
>   local_irq_disable();
>   local_irq_restore(flags);
>
> ... assuming we consider that legitimate?

Nah. That's old and insane gunk.

Thanks,

tglx
---
 arch/x86/kernel/apic/io_apic.c |   22 ++
 1 file changed, 6 insertions(+), 16 deletions(-)

--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -1618,21 +1618,16 @@ static void __init delay_without_tsc(voi
 static int __init timer_irq_works(void)
 {
unsigned long t1 = jiffies;
-   unsigned long flags;
 
if (no_timer_check)
return 1;
 
-   local_save_flags(flags);
local_irq_enable();
-
if (boot_cpu_has(X86_FEATURE_TSC))
delay_with_tsc();
else
delay_without_tsc();
 
-   local_irq_restore(flags);
-
/*
 * Expect a few ticks at least, to be sure some possible
 * glue logic does not lock up after one or two first
@@ -1641,10 +1636,10 @@ static int __init timer_irq_works(void)
 * least one tick may be lost due to delays.
 */
 
-   /* jiffies wrap? */
-   if (time_after(jiffies, t1 + 4))
-   return 1;
-   return 0;
+   local_irq_disable();
+
+   /* Did jiffies advance? */
+   return time_after(jiffies, t1 + 4);
 }
 
 /*
@@ -2117,13 +2112,12 @@ static inline void __init check_timer(vo
struct irq_cfg *cfg = irqd_cfg(irq_data);
int node = cpu_to_node(0);
int apic1, pin1, apic2, pin2;
-   unsigned long flags;
int no_pin1 = 0;
 
if (!global_clock_event)
return;
 
-   local_irq_save(flags);
+   local_irq_disable();
 
/*
 * get/set the timer IRQ vector:
@@ -2191,7 +2185,6 @@ static inline void __init check_timer(vo
goto out;
}
panic_if_irq_remap("timer doesn't work through 
Interrupt-remapped IO-APIC");
-   local_irq_disable();
clear_IO_APIC_pin(apic1, pin1);
if (!no_pin1)
apic_printk(APIC_QUIET, KERN_ERR "..MP-BIOS bug: "
@@ -2215,7 +2208,6 @@ static inline void __init check_timer(vo
/*
 * Cleanup, just in case ...
 */
-   local_irq_disable();
legacy_pic->mask(0);
clear_IO_APIC_pin(apic2, pin2);
apic_printk(APIC_QUIET, KERN_INFO "... failed.\n");
@@ -2232,7 +2224,6 @@ static inline void __init check_timer(vo
apic_printk(APIC_QUIET, KERN_INFO ". works.\n");
goto out;
}
-   local_irq_disable();
legacy_pic->mask(0);
apic_write(APIC_LVT0, APIC_LVT_MASKED | APIC_DM_FIXED | cfg->vector);
apic_printk(APIC_QUIET, KERN_INFO ". failed.\n");
@@ -2251,7 +2242,6 @@ static inline void __init check_timer(vo
apic_printk(APIC_QUIET, KERN_INFO ". works.\n");
goto out;
}
-   local_irq_disable();
apic_printk(APIC_QUIET, KERN_INFO ". failed :(.\n");
if (apic_is_x2apic_enabled())
apic_printk(APIC_QUIET, KERN_INFO
@@ -2260,7 +2250,7 @@ static inline void __init check_timer(vo
panic("IO-APIC + timer doesn't work!  Boot with apic=debug and send a "
"report.  Then try booting with the 'noapic' option.\n");
 out:
-   local_irq_restore(flags);
+   local_irq_enable();
 }
 
 /*
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [patch V3 10/37] ARM: highmem: Switch to generic kmap atomic

2020-11-12 Thread Thomas Gleixner
Marek,

On Thu, Nov 12 2020 at 09:10, Marek Szyprowski wrote:
> On 03.11.2020 10:27, Thomas Gleixner wrote:
>
> I can do more tests to help fixing this issue. Just let me know what to do.

Just sent out the fix before I saw your report.

 https://lore.kernel.org/r/87y2j6n8mj@nanos.tec.linutronix.de

Thanks,

tglx
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [patch V3 22/37] highmem: High implementation details and document API

2020-11-03 Thread Thomas Gleixner
On Tue, Nov 03 2020 at 09:48, Linus Torvalds wrote:
> I have no complaints about the patch, but it strikes me that if people
> want to actually have much better debug coverage, this is where it
> should be (I like the "every other address" thing too, don't get me
> wrong).
>
> In particular, instead of these PageHighMem(page) tests, I think
> something like this would be better:
>
>#ifdef CONFIG_DEBUG_HIGHMEM
>  #define page_use_kmap(page) ((page),1)
>#else
>  #define page_use_kmap(page) PageHighMem(page)
>#endif
>
> adn then replace those "if (!PageHighMem(page))" tests with "if
> (!page_use_kmap())" instead.
>
> IOW, in debug mode, it would _always_ remap the page, whether it's
> highmem or not. That would really stress the highmem code and find any
> fragilities.

Yes, that makes a lot of sense. We just have to avoid that for the
architectures with aliasing issues.

> Anyway, this is all sepatrate from the series, which still looks fine
> to me. Just a reaction to seeing the patch, and Thomas' earlier
> mention that the highmem debugging doesn't actually do much.

Right, forcing it for both kmap and kmap_local is straight forward. I'll
cook a patch on top for that.

Thanks,

tglx


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[patch V4 24/37] sched: highmem: Store local kmaps in task struct

2020-11-03 Thread Thomas Gleixner
Instead of storing the map per CPU provide and use per task storage. That
prepares for local kmaps which are preemptible.

The context switch code is preparatory and not yet in use because
kmap_atomic() runs with preemption disabled. Will be made usable in the
next step.

The context switch logic is safe even when an interrupt happens after
clearing or before restoring the kmaps. The kmap index in task struct is
not modified so any nesting kmap in an interrupt will use unused indices
and on return the counter is the same as before.

Also add an assert into the return to user space code. Going back to user
space with an active kmap local is a nono.

Signed-off-by: Thomas Gleixner 
---
V4: Use the version which actually compiles and works
V3: Handle the debug case correctly
---
 include/linux/highmem-internal.h |   10 +++
 include/linux/sched.h|9 +++
 kernel/entry/common.c|2 
 kernel/fork.c|1 
 kernel/sched/core.c  |   18 +++
 mm/highmem.c |   99 +++
 6 files changed, 129 insertions(+), 10 deletions(-)

--- a/include/linux/highmem-internal.h
+++ b/include/linux/highmem-internal.h
@@ -9,6 +9,16 @@
 void *__kmap_local_pfn_prot(unsigned long pfn, pgprot_t prot);
 void *__kmap_local_page_prot(struct page *page, pgprot_t prot);
 void kunmap_local_indexed(void *vaddr);
+void kmap_local_fork(struct task_struct *tsk);
+void __kmap_local_sched_out(void);
+void __kmap_local_sched_in(void);
+static inline void kmap_assert_nomap(void)
+{
+   DEBUG_LOCKS_WARN_ON(current->kmap_ctrl.idx);
+}
+#else
+static inline void kmap_local_fork(struct task_struct *tsk) { }
+static inline void kmap_assert_nomap(void) { }
 #endif
 
 #ifdef CONFIG_HIGHMEM
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -34,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* task_struct member predeclarations (sorted alphabetically): */
 struct audit_context;
@@ -629,6 +630,13 @@ struct wake_q_node {
struct wake_q_node *next;
 };
 
+struct kmap_ctrl {
+#ifdef CONFIG_KMAP_LOCAL
+   int idx;
+   pte_t   pteval[KM_MAX_IDX];
+#endif
+};
+
 struct task_struct {
 #ifdef CONFIG_THREAD_INFO_IN_TASK
/*
@@ -1294,6 +1302,7 @@ struct task_struct {
unsigned intsequential_io;
unsigned intsequential_io_avg;
 #endif
+   struct kmap_ctrlkmap_ctrl;
 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
unsigned long   task_state_change;
 #endif
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -2,6 +2,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -194,6 +195,7 @@ static void exit_to_user_mode_prepare(st
 
/* Ensure that the address limit is intact and no locks are held */
addr_limit_user_check();
+   kmap_assert_nomap();
lockdep_assert_irqs_disabled();
lockdep_sys_exit();
 }
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -930,6 +930,7 @@ static struct task_struct *dup_task_stru
account_kernel_stack(tsk, 1);
 
kcov_task_init(tsk);
+   kmap_local_fork(tsk);
 
 #ifdef CONFIG_FAULT_INJECTION
tsk->fail_nth = 0;
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4053,6 +4053,22 @@ static inline void finish_lock_switch(st
 # define finish_arch_post_lock_switch()do { } while (0)
 #endif
 
+static inline void kmap_local_sched_out(void)
+{
+#ifdef CONFIG_KMAP_LOCAL
+   if (unlikely(current->kmap_ctrl.idx))
+   __kmap_local_sched_out();
+#endif
+}
+
+static inline void kmap_local_sched_in(void)
+{
+#ifdef CONFIG_KMAP_LOCAL
+   if (unlikely(current->kmap_ctrl.idx))
+   __kmap_local_sched_in();
+#endif
+}
+
 /**
  * prepare_task_switch - prepare to switch tasks
  * @rq: the runqueue preparing to switch
@@ -4075,6 +4091,7 @@ prepare_task_switch(struct rq *rq, struc
perf_event_task_sched_out(prev, next);
rseq_preempt(prev);
fire_sched_out_preempt_notifiers(prev, next);
+   kmap_local_sched_out();
prepare_task(next);
prepare_arch_switch(next);
 }
@@ -4141,6 +4158,7 @@ static struct rq *finish_task_switch(str
finish_lock_switch(rq);
finish_arch_post_lock_switch();
kcov_finish_switch(current);
+   kmap_local_sched_in();
 
fire_sched_in_preempt_notifiers(current);
/*
--- a/mm/highmem.c
+++ b/mm/highmem.c
@@ -365,8 +365,6 @@ EXPORT_SYMBOL(kunmap_high);
 
 #include 
 
-static DEFINE_PER_CPU(int, __kmap_local_idx);
-
 /*
  * With DEBUG_HIGHMEM the stack depth is doubled and every second
  * slot is unused which acts as a guard page
@@ -379,23 +377,21 @@ static DEFINE_PER_CPU(int, __kmap_local_
 
 static inline int kmap_local_idx_push(void)
 {
-   int idx = __this_cpu_add_return(__kmap_local_idx, KM_INCR) - 1;
-
WARN_ON_ONCE(in_irq() &

Re: [patch V3 24/37] sched: highmem: Store local kmaps in task struct

2020-11-03 Thread Thomas Gleixner
On Tue, Nov 03 2020 at 10:27, Thomas Gleixner wrote:
> +struct kmap_ctrl {
> +#ifdef CONFIG_KMAP_LOCAL
> + int idx;
> + pte_t   pteval[KM_TYPE_NR];

I'm a moron. Fixed it on the test machine ...
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[patch V3 27/37] x86/crashdump/32: Simplify copy_oldmem_page()

2020-11-03 Thread Thomas Gleixner
Replace kmap_atomic_pfn() with kmap_local_pfn() which is preemptible and
can take page faults.

Remove the indirection of the dump page and the related cruft which is not
longer required.

Signed-off-by: Thomas Gleixner 
---
V3: New patch
---
 arch/x86/kernel/crash_dump_32.c |   48 
 1 file changed, 10 insertions(+), 38 deletions(-)

--- a/arch/x86/kernel/crash_dump_32.c
+++ b/arch/x86/kernel/crash_dump_32.c
@@ -13,8 +13,6 @@
 
 #include 
 
-static void *kdump_buf_page;
-
 static inline bool is_crashed_pfn_valid(unsigned long pfn)
 {
 #ifndef CONFIG_X86_PAE
@@ -41,15 +39,11 @@ static inline bool is_crashed_pfn_valid(
  * @userbuf: if set, @buf is in user address space, use copy_to_user(),
  * otherwise @buf is in kernel address space, use memcpy().
  *
- * Copy a page from "oldmem". For this page, there is no pte mapped
- * in the current kernel. We stitch up a pte, similar to kmap_atomic.
- *
- * Calling copy_to_user() in atomic context is not desirable. Hence first
- * copying the data to a pre-allocated kernel page and then copying to user
- * space in non-atomic context.
+ * Copy a page from "oldmem". For this page, there might be no pte mapped
+ * in the current kernel.
  */
-ssize_t copy_oldmem_page(unsigned long pfn, char *buf,
-   size_t csize, unsigned long offset, int userbuf)
+ssize_t copy_oldmem_page(unsigned long pfn, char *buf, size_t csize,
+unsigned long offset, int userbuf)
 {
void  *vaddr;
 
@@ -59,38 +53,16 @@ ssize_t copy_oldmem_page(unsigned long p
if (!is_crashed_pfn_valid(pfn))
return -EFAULT;
 
-   vaddr = kmap_atomic_pfn(pfn);
+   vaddr = kmap_local_pfn(pfn);
 
if (!userbuf) {
-   memcpy(buf, (vaddr + offset), csize);
-   kunmap_atomic(vaddr);
+   memcpy(buf, vaddr + offset, csize);
} else {
-   if (!kdump_buf_page) {
-   printk(KERN_WARNING "Kdump: Kdump buffer page not"
-   " allocated\n");
-   kunmap_atomic(vaddr);
-   return -EFAULT;
-   }
-   copy_page(kdump_buf_page, vaddr);
-   kunmap_atomic(vaddr);
-   if (copy_to_user(buf, (kdump_buf_page + offset), csize))
-   return -EFAULT;
+   if (copy_to_user(buf, vaddr + offset, csize))
+   csize = -EFAULT;
}
 
-   return csize;
-}
+   kunmap_local(vaddr);
 
-static int __init kdump_buf_page_init(void)
-{
-   int ret = 0;
-
-   kdump_buf_page = kmalloc(PAGE_SIZE, GFP_KERNEL);
-   if (!kdump_buf_page) {
-   printk(KERN_WARNING "Kdump: Failed to allocate kdump buffer"
-" page\n");
-   ret = -ENOMEM;
-   }
-
-   return ret;
+   return csize;
 }
-arch_initcall(kdump_buf_page_init);

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[patch V3 19/37] mm/highmem: Remove the old kmap_atomic cruft

2020-11-03 Thread Thomas Gleixner
All users gone.

Signed-off-by: Thomas Gleixner 
---
 include/linux/highmem.h |   63 +++-
 mm/highmem.c|7 -
 2 files changed, 5 insertions(+), 65 deletions(-)

--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -86,31 +86,16 @@ static inline void kunmap(struct page *p
  * be used in IRQ contexts, so in some (very limited) cases we need
  * it.
  */
-
-#ifndef CONFIG_KMAP_LOCAL
-void *kmap_atomic_high_prot(struct page *page, pgprot_t prot);
-void kunmap_atomic_high(void *kvaddr);
-
 static inline void *kmap_atomic_prot(struct page *page, pgprot_t prot)
 {
preempt_disable();
pagefault_disable();
-   if (!PageHighMem(page))
-   return page_address(page);
-   return kmap_atomic_high_prot(page, prot);
-}
-
-static inline void __kunmap_atomic(void *vaddr)
-{
-   kunmap_atomic_high(vaddr);
+   return __kmap_local_page_prot(page, prot);
 }
-#else /* !CONFIG_KMAP_LOCAL */
 
-static inline void *kmap_atomic_prot(struct page *page, pgprot_t prot)
+static inline void *kmap_atomic(struct page *page)
 {
-   preempt_disable();
-   pagefault_disable();
-   return __kmap_local_page_prot(page, prot);
+   return kmap_atomic_prot(page, kmap_prot);
 }
 
 static inline void *kmap_atomic_pfn(unsigned long pfn)
@@ -125,13 +110,6 @@ static inline void __kunmap_atomic(void
kunmap_local_indexed(addr);
 }
 
-#endif /* CONFIG_KMAP_LOCAL */
-
-static inline void *kmap_atomic(struct page *page)
-{
-   return kmap_atomic_prot(page, kmap_prot);
-}
-
 /* declarations for linux/mm/highmem.c */
 unsigned int nr_free_highpages(void);
 extern atomic_long_t _totalhigh_pages;
@@ -212,41 +190,8 @@ static inline void __kunmap_atomic(void
 
 #define kmap_flush_unused()do {} while(0)
 
-#endif /* CONFIG_HIGHMEM */
-
-#if !defined(CONFIG_KMAP_LOCAL)
-#if defined(CONFIG_HIGHMEM)
-
-DECLARE_PER_CPU(int, __kmap_atomic_idx);
-
-static inline int kmap_atomic_idx_push(void)
-{
-   int idx = __this_cpu_inc_return(__kmap_atomic_idx) - 1;
-
-#ifdef CONFIG_DEBUG_HIGHMEM
-   WARN_ON_ONCE(in_irq() && !irqs_disabled());
-   BUG_ON(idx >= KM_TYPE_NR);
-#endif
-   return idx;
-}
-
-static inline int kmap_atomic_idx(void)
-{
-   return __this_cpu_read(__kmap_atomic_idx) - 1;
-}
 
-static inline void kmap_atomic_idx_pop(void)
-{
-#ifdef CONFIG_DEBUG_HIGHMEM
-   int idx = __this_cpu_dec_return(__kmap_atomic_idx);
-
-   BUG_ON(idx < 0);
-#else
-   __this_cpu_dec(__kmap_atomic_idx);
-#endif
-}
-#endif
-#endif
+#endif /* CONFIG_HIGHMEM */
 
 /*
  * Prevent people trying to call kunmap_atomic() as if it were kunmap()
--- a/mm/highmem.c
+++ b/mm/highmem.c
@@ -31,12 +31,6 @@
 #include 
 #include 
 
-#ifndef CONFIG_KMAP_LOCAL
-#ifdef CONFIG_HIGHMEM
-DEFINE_PER_CPU(int, __kmap_atomic_idx);
-#endif
-#endif
-
 /*
  * Virtual_count is not a pure "count".
  *  0 means that it is not mapped, and has not been mapped
@@ -410,6 +404,7 @@ static inline void kmap_local_idx_pop(vo
 #ifndef arch_kmap_local_post_map
 # define arch_kmap_local_post_map(vaddr, pteval)   do { } while (0)
 #endif
+
 #ifndef arch_kmap_local_pre_unmap
 # define arch_kmap_local_pre_unmap(vaddr)  do { } while (0)
 #endif

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[patch V3 37/37] io-mapping: Remove io_mapping_map_atomic_wc()

2020-11-03 Thread Thomas Gleixner
No more users. Get rid of it and remove the traces in documentation.

Signed-off-by: Thomas Gleixner 
---
V3: New patch
---
 Documentation/driver-api/io-mapping.rst |   22 +---
 include/linux/io-mapping.h  |   42 +---
 2 files changed, 9 insertions(+), 55 deletions(-)

--- a/Documentation/driver-api/io-mapping.rst
+++ b/Documentation/driver-api/io-mapping.rst
@@ -21,19 +21,15 @@ mappable, while 'size' indicates how lar
 enable. Both are in bytes.
 
 This _wc variant provides a mapping which may only be used with
-io_mapping_map_atomic_wc(), io_mapping_map_local_wc() or
-io_mapping_map_wc().
+io_mapping_map_local_wc() or io_mapping_map_wc().
 
 With this mapping object, individual pages can be mapped either temporarily
 or long term, depending on the requirements. Of course, temporary maps are
-more efficient. They come in two flavours::
+more efficient.
 
void *io_mapping_map_local_wc(struct io_mapping *mapping,
  unsigned long offset)
 
-   void *io_mapping_map_atomic_wc(struct io_mapping *mapping,
-  unsigned long offset)
-
 'offset' is the offset within the defined mapping region.  Accessing
 addresses beyond the region specified in the creation function yields
 undefined results. Using an offset which is not page aligned yields an
@@ -50,9 +46,6 @@ io_mapping_map_local_wc() has a side eff
 migration to make the mapping code work. No caller can rely on this side
 effect.
 
-io_mapping_map_atomic_wc() has the side effect of disabling preemption and
-pagefaults. Don't use in new code. Use io_mapping_map_local_wc() instead.
-
 Nested mappings need to be undone in reverse order because the mapping
 code uses a stack for keeping track of them::
 
@@ -65,11 +58,10 @@ Nested mappings need to be undone in rev
 The mappings are released with::
 
void io_mapping_unmap_local(void *vaddr)
-   void io_mapping_unmap_atomic(void *vaddr)
 
-'vaddr' must be the value returned by the last io_mapping_map_local_wc() or
-io_mapping_map_atomic_wc() call. This unmaps the specified mapping and
-undoes the side effects of the mapping functions.
+'vaddr' must be the value returned by the last io_mapping_map_local_wc()
+call. This unmaps the specified mapping and undoes eventual side effects of
+the mapping function.
 
 If you need to sleep while holding a mapping, you can use the regular
 variant, although this may be significantly slower::
@@ -77,8 +69,8 @@ If you need to sleep while holding a map
void *io_mapping_map_wc(struct io_mapping *mapping,
unsigned long offset)
 
-This works like io_mapping_map_atomic/local_wc() except it has no side
-effects and the pointer is globaly visible.
+This works like io_mapping_map_local_wc() except it has no side effects and
+the pointer is globaly visible.
 
 The mappings are released with::
 
--- a/include/linux/io-mapping.h
+++ b/include/linux/io-mapping.h
@@ -60,28 +60,7 @@ io_mapping_fini(struct io_mapping *mappi
iomap_free(mapping->base, mapping->size);
 }
 
-/* Atomic map/unmap */
-static inline void __iomem *
-io_mapping_map_atomic_wc(struct io_mapping *mapping,
-unsigned long offset)
-{
-   resource_size_t phys_addr;
-
-   BUG_ON(offset >= mapping->size);
-   phys_addr = mapping->base + offset;
-   preempt_disable();
-   pagefault_disable();
-   return __iomap_local_pfn_prot(PHYS_PFN(phys_addr), mapping->prot);
-}
-
-static inline void
-io_mapping_unmap_atomic(void __iomem *vaddr)
-{
-   kunmap_local_indexed((void __force *)vaddr);
-   pagefault_enable();
-   preempt_enable();
-}
-
+/* Temporary mappings which are only valid in the current context */
 static inline void __iomem *
 io_mapping_map_local_wc(struct io_mapping *mapping, unsigned long offset)
 {
@@ -163,24 +142,7 @@ io_mapping_unmap(void __iomem *vaddr)
 {
 }
 
-/* Atomic map/unmap */
-static inline void __iomem *
-io_mapping_map_atomic_wc(struct io_mapping *mapping,
-unsigned long offset)
-{
-   preempt_disable();
-   pagefault_disable();
-   return io_mapping_map_wc(mapping, offset, PAGE_SIZE);
-}
-
-static inline void
-io_mapping_unmap_atomic(void __iomem *vaddr)
-{
-   io_mapping_unmap(vaddr);
-   pagefault_enable();
-   preempt_enable();
-}
-
+/* Temporary mappings which are only valid in the current context */
 static inline void __iomem *
 io_mapping_map_local_wc(struct io_mapping *mapping, unsigned long offset)
 {

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[patch V3 20/37] io-mapping: Cleanup atomic iomap

2020-11-03 Thread Thomas Gleixner
Switch the atomic iomap implementation over to kmap_local and stick the
preempt/pagefault mechanics into the generic code similar to the
kmap_atomic variants.

Rename the x86 map function in preparation for a non-atomic variant.

Signed-off-by: Thomas Gleixner 
---
V2: New patch to make review easier
---
 arch/x86/include/asm/iomap.h |9 +
 arch/x86/mm/iomap_32.c   |6 ++
 include/linux/io-mapping.h   |8 ++--
 3 files changed, 9 insertions(+), 14 deletions(-)

--- a/arch/x86/include/asm/iomap.h
+++ b/arch/x86/include/asm/iomap.h
@@ -13,14 +13,7 @@
 #include 
 #include 
 
-void __iomem *iomap_atomic_pfn_prot(unsigned long pfn, pgprot_t prot);
-
-static inline void iounmap_atomic(void __iomem *vaddr)
-{
-   kunmap_local_indexed((void __force *)vaddr);
-   pagefault_enable();
-   preempt_enable();
-}
+void __iomem *__iomap_local_pfn_prot(unsigned long pfn, pgprot_t prot);
 
 int iomap_create_wc(resource_size_t base, unsigned long size, pgprot_t *prot);
 
--- a/arch/x86/mm/iomap_32.c
+++ b/arch/x86/mm/iomap_32.c
@@ -44,7 +44,7 @@ void iomap_free(resource_size_t base, un
 }
 EXPORT_SYMBOL_GPL(iomap_free);
 
-void __iomem *iomap_atomic_pfn_prot(unsigned long pfn, pgprot_t prot)
+void __iomem *__iomap_local_pfn_prot(unsigned long pfn, pgprot_t prot)
 {
/*
 * For non-PAT systems, translate non-WB request to UC- just in
@@ -60,8 +60,6 @@ void __iomem *iomap_atomic_pfn_prot(unsi
/* Filter out unsupported __PAGE_KERNEL* bits: */
pgprot_val(prot) &= __default_kernel_pte_mask;
 
-   preempt_disable();
-   pagefault_disable();
return (void __force __iomem *)__kmap_local_pfn_prot(pfn, prot);
 }
-EXPORT_SYMBOL_GPL(iomap_atomic_pfn_prot);
+EXPORT_SYMBOL_GPL(__iomap_local_pfn_prot);
--- a/include/linux/io-mapping.h
+++ b/include/linux/io-mapping.h
@@ -69,13 +69,17 @@ io_mapping_map_atomic_wc(struct io_mappi
 
BUG_ON(offset >= mapping->size);
phys_addr = mapping->base + offset;
-   return iomap_atomic_pfn_prot(PHYS_PFN(phys_addr), mapping->prot);
+   preempt_disable();
+   pagefault_disable();
+   return __iomap_local_pfn_prot(PHYS_PFN(phys_addr), mapping->prot);
 }
 
 static inline void
 io_mapping_unmap_atomic(void __iomem *vaddr)
 {
-   iounmap_atomic(vaddr);
+   kunmap_local_indexed((void __force *)vaddr);
+   pagefault_enable();
+   preempt_enable();
 }
 
 static inline void __iomem *

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[patch V3 31/37] drm/ttm: Replace kmap_atomic() usage

2020-11-03 Thread Thomas Gleixner
There is no reason to disable pagefaults and preemption as a side effect of
kmap_atomic_prot().

Use kmap_local_page_prot() instead and document the reasoning for the
mapping usage with the given pgprot.

Remove the NULL pointer check for the map. These functions return a valid
address for valid pages and the return was bogus anyway as it would have
left preemption and pagefaults disabled.

Signed-off-by: Thomas Gleixner 
Cc: Christian Koenig 
Cc: Huang Rui 
Cc: David Airlie 
Cc: Daniel Vetter 
Cc: dri-de...@lists.freedesktop.org
---
V3: New patch
---
 drivers/gpu/drm/ttm/ttm_bo_util.c |   20 
 1 file changed, 12 insertions(+), 8 deletions(-)

--- a/drivers/gpu/drm/ttm/ttm_bo_util.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_util.c
@@ -181,13 +181,15 @@ static int ttm_copy_io_ttm_page(struct t
return -ENOMEM;
 
src = (void *)((unsigned long)src + (page << PAGE_SHIFT));
-   dst = kmap_atomic_prot(d, prot);
-   if (!dst)
-   return -ENOMEM;
+   /*
+* Ensure that a highmem page is mapped with the correct
+* pgprot. For non highmem the mapping is already there.
+*/
+   dst = kmap_local_page_prot(d, prot);
 
memcpy_fromio(dst, src, PAGE_SIZE);
 
-   kunmap_atomic(dst);
+   kunmap_local(dst);
 
return 0;
 }
@@ -203,13 +205,15 @@ static int ttm_copy_ttm_io_page(struct t
return -ENOMEM;
 
dst = (void *)((unsigned long)dst + (page << PAGE_SHIFT));
-   src = kmap_atomic_prot(s, prot);
-   if (!src)
-   return -ENOMEM;
+   /*
+* Ensure that a highmem page is mapped with the correct
+* pgprot. For non highmem the mapping is already there.
+*/
+   src = kmap_local_page_prot(s, prot);
 
memcpy_toio(dst, src, PAGE_SIZE);
 
-   kunmap_atomic(src);
+   kunmap_local(src);
 
return 0;
 }

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[patch V3 09/37] arc/mm/highmem: Use generic kmap atomic implementation

2020-11-03 Thread Thomas Gleixner
Adopt the map ordering to match the other architectures and the generic
code. Also make the maximum entries limited and not dependend on the number
of CPUs. With the original implementation did the following calculation:

   nr_slots = mapsize >> PAGE_SHIFT;

The results in either 512 or 1024 total slots depending on
configuration. The total slots have to be divided by the number of CPUs to
get the number of slots per CPU (former KM_TYPE_NR). ARC supports up to 4k
CPUs, so this just falls apart in random ways depending on the number of
CPUs and the actual kmap (atomic) nesting. The comment in highmem.c:

 * - fixmap anyhow needs a limited number of mappings. So 2M kvaddr == 256 PTE
 *   slots across NR_CPUS would be more than sufficient (generic code defines
 *   KM_TYPE_NR as 20).

is just wrong. KM_TYPE_NR (now KM_MAX_IDX) is the number of slots per CPU
because kmap_local/atomic() needs to support nested mappings (thread,
softirq, interrupt). While KM_MAX_IDX might be overestimated, the above
reasoning is just wrong and clearly the highmem code was never tested with
any system with more than a few CPUs.

Use the default number of slots and fail the build when it does not
fit. Randomly failing at runtime is not a really good option.

Signed-off-by: Thomas Gleixner 
Cc: Vineet Gupta 
Cc: linux-snps-...@lists.infradead.org
---
V3: Make it actually more correct.
---
 arch/arc/Kconfig  |1 
 arch/arc/include/asm/highmem.h|   26 ++
 arch/arc/include/asm/kmap_types.h |   14 -
 arch/arc/mm/highmem.c |   54 +++---
 4 files changed, 26 insertions(+), 69 deletions(-)

--- a/arch/arc/Kconfig
+++ b/arch/arc/Kconfig
@@ -507,6 +507,7 @@ config LINUX_RAM_BASE
 config HIGHMEM
bool "High Memory Support"
select ARCH_DISCONTIGMEM_ENABLE
+   select KMAP_LOCAL
help
  With ARC 2G:2G address split, only upper 2G is directly addressable by
  kernel. Enable this to potentially allow access to rest of 2G and PAE
--- a/arch/arc/include/asm/highmem.h
+++ b/arch/arc/include/asm/highmem.h
@@ -9,17 +9,29 @@
 #ifdef CONFIG_HIGHMEM
 
 #include 
-#include 
+#include 
+
+#define FIXMAP_SIZEPGDIR_SIZE
+#define PKMAP_SIZE PGDIR_SIZE
 
 /* start after vmalloc area */
 #define FIXMAP_BASE(PAGE_OFFSET - FIXMAP_SIZE - PKMAP_SIZE)
-#define FIXMAP_SIZEPGDIR_SIZE  /* only 1 PGD worth */
-#define KM_TYPE_NR ((FIXMAP_SIZE >> PAGE_SHIFT)/NR_CPUS)
-#define FIXMAP_ADDR(nr)(FIXMAP_BASE + ((nr) << PAGE_SHIFT))
+
+#define FIX_KMAP_SLOTS (KM_MAX_IDX * NR_CPUS)
+#define FIX_KMAP_BEGIN (0UL)
+#define FIX_KMAP_END   ((FIX_KMAP_BEGIN + FIX_KMAP_SLOTS) - 1)
+
+#define FIXADDR_TOP(FIXMAP_BASE + (FIX_KMAP_END << PAGE_SHIFT))
+
+/*
+ * This should be converted to the asm-generic version, but of course this
+ * is needlessly different from all other architectures. Sigh - tglx
+ */
+#define __fix_to_virt(x)   (FIXADDR_TOP - ((x) << PAGE_SHIFT))
+#define __virt_to_fix(x)   (((FIXADDR_TOP - ((x) & PAGE_MASK))) >> 
PAGE_SHIFT)
 
 /* start after fixmap area */
 #define PKMAP_BASE (FIXMAP_BASE + FIXMAP_SIZE)
-#define PKMAP_SIZE PGDIR_SIZE
 #define LAST_PKMAP (PKMAP_SIZE >> PAGE_SHIFT)
 #define LAST_PKMAP_MASK(LAST_PKMAP - 1)
 #define PKMAP_ADDR(nr) (PKMAP_BASE + ((nr) << PAGE_SHIFT))
@@ -29,11 +41,13 @@
 
 extern void kmap_init(void);
 
+#define arch_kmap_local_post_unmap(vaddr)  \
+   local_flush_tlb_kernel_range(vaddr, vaddr + PAGE_SIZE)
+
 static inline void flush_cache_kmaps(void)
 {
flush_cache_all();
 }
-
 #endif
 
 #endif
--- a/arch/arc/include/asm/kmap_types.h
+++ /dev/null
@@ -1,14 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-only */
-/*
- * Copyright (C) 2015 Synopsys, Inc. (www.synopsys.com)
- */
-
-#ifndef _ASM_KMAP_TYPES_H
-#define _ASM_KMAP_TYPES_H
-
-/*
- * We primarily need to define KM_TYPE_NR here but that in turn
- * is a function of PGDIR_SIZE etc.
- * To avoid circular deps issue, put everything in asm/highmem.h
- */
-#endif
--- a/arch/arc/mm/highmem.c
+++ b/arch/arc/mm/highmem.c
@@ -36,9 +36,8 @@
  *   This means each only has 1 PGDIR_SIZE worth of kvaddr mappings, which 
means
  *   2M of kvaddr space for typical config (8K page and 11:8:13 traversal 
split)
  *
- * - fixmap anyhow needs a limited number of mappings. So 2M kvaddr == 256 PTE
- *   slots across NR_CPUS would be more than sufficient (generic code defines
- *   KM_TYPE_NR as 20).
+ * - The fixed KMAP slots for kmap_local/atomic() require KM_MAX_IDX slots per
+ *   CPU. So the number of CPUs sharing a single PTE page is limited.
  *
  * - pkmap being preemptible, in theory could do with more than 256 concurrent
  *   mappings. However, generic pkmap code: map_new_virtu

[patch V3 13/37] mips/mm/highmem: Switch to generic kmap atomic

2020-11-03 Thread Thomas Gleixner
No reason having the same code in every architecture

Signed-off-by: Thomas Gleixner 
Cc: Thomas Bogendoerfer 
Cc: linux-m...@vger.kernel.org
---
V3: Remove the kmap types cruft
---
 arch/mips/Kconfig  |1 
 arch/mips/include/asm/fixmap.h |4 -
 arch/mips/include/asm/highmem.h|6 +-
 arch/mips/include/asm/kmap_types.h |   13 --
 arch/mips/mm/highmem.c |   77 -
 arch/mips/mm/init.c|4 -
 6 files changed, 6 insertions(+), 99 deletions(-)

--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -2719,6 +2719,7 @@ config WAR_MIPS34K_MISSED_ITLB
 config HIGHMEM
bool "High Memory Support"
depends on 32BIT && CPU_SUPPORTS_HIGHMEM && SYS_SUPPORTS_HIGHMEM && 
!CPU_MIPS32_3_5_EVA
+   select KMAP_LOCAL
 
 config CPU_SUPPORTS_HIGHMEM
bool
--- a/arch/mips/include/asm/fixmap.h
+++ b/arch/mips/include/asm/fixmap.h
@@ -17,7 +17,7 @@
 #include 
 #ifdef CONFIG_HIGHMEM
 #include 
-#include 
+#include 
 #endif
 
 /*
@@ -52,7 +52,7 @@ enum fixed_addresses {
 #ifdef CONFIG_HIGHMEM
/* reserved pte's for temporary kernel mappings */
FIX_KMAP_BEGIN = FIX_CMAP_END + 1,
-   FIX_KMAP_END = FIX_KMAP_BEGIN+(KM_TYPE_NR*NR_CPUS)-1,
+   FIX_KMAP_END = FIX_KMAP_BEGIN + (KM_MAX_IDX * NR_CPUS) - 1,
 #endif
__end_of_fixed_addresses
 };
--- a/arch/mips/include/asm/highmem.h
+++ b/arch/mips/include/asm/highmem.h
@@ -24,7 +24,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 
 /* declarations for highmem.c */
 extern unsigned long highstart_pfn, highend_pfn;
@@ -48,11 +48,11 @@ extern pte_t *pkmap_page_table;
 
 #define ARCH_HAS_KMAP_FLUSH_TLB
 extern void kmap_flush_tlb(unsigned long addr);
-extern void *kmap_atomic_pfn(unsigned long pfn);
 
 #define flush_cache_kmaps()BUG_ON(cpu_has_dc_aliases)
 
-extern void kmap_init(void);
+#define arch_kmap_local_post_map(vaddr, pteval)
local_flush_tlb_one(vaddr)
+#define arch_kmap_local_post_unmap(vaddr)  local_flush_tlb_one(vaddr)
 
 #endif /* __KERNEL__ */
 
--- a/arch/mips/include/asm/kmap_types.h
+++ /dev/null
@@ -1,13 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _ASM_KMAP_TYPES_H
-#define _ASM_KMAP_TYPES_H
-
-#ifdef CONFIG_DEBUG_HIGHMEM
-#define __WITH_KM_FENCE
-#endif
-
-#include 
-
-#undef __WITH_KM_FENCE
-
-#endif
--- a/arch/mips/mm/highmem.c
+++ b/arch/mips/mm/highmem.c
@@ -8,8 +8,6 @@
 #include 
 #include 
 
-static pte_t *kmap_pte;
-
 unsigned long highstart_pfn, highend_pfn;
 
 void kmap_flush_tlb(unsigned long addr)
@@ -17,78 +15,3 @@ void kmap_flush_tlb(unsigned long addr)
flush_tlb_one(addr);
 }
 EXPORT_SYMBOL(kmap_flush_tlb);
-
-void *kmap_atomic_high_prot(struct page *page, pgprot_t prot)
-{
-   unsigned long vaddr;
-   int idx, type;
-
-   type = kmap_atomic_idx_push();
-   idx = type + KM_TYPE_NR*smp_processor_id();
-   vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
-#ifdef CONFIG_DEBUG_HIGHMEM
-   BUG_ON(!pte_none(*(kmap_pte - idx)));
-#endif
-   set_pte(kmap_pte-idx, mk_pte(page, prot));
-   local_flush_tlb_one((unsigned long)vaddr);
-
-   return (void*) vaddr;
-}
-EXPORT_SYMBOL(kmap_atomic_high_prot);
-
-void kunmap_atomic_high(void *kvaddr)
-{
-   unsigned long vaddr = (unsigned long) kvaddr & PAGE_MASK;
-   int type __maybe_unused;
-
-   if (vaddr < FIXADDR_START)
-   return;
-
-   type = kmap_atomic_idx();
-#ifdef CONFIG_DEBUG_HIGHMEM
-   {
-   int idx = type + KM_TYPE_NR * smp_processor_id();
-
-   BUG_ON(vaddr != __fix_to_virt(FIX_KMAP_BEGIN + idx));
-
-   /*
-* force other mappings to Oops if they'll try to access
-* this pte without first remap it
-*/
-   pte_clear(&init_mm, vaddr, kmap_pte-idx);
-   local_flush_tlb_one(vaddr);
-   }
-#endif
-   kmap_atomic_idx_pop();
-}
-EXPORT_SYMBOL(kunmap_atomic_high);
-
-/*
- * This is the same as kmap_atomic() but can map memory that doesn't
- * have a struct page associated with it.
- */
-void *kmap_atomic_pfn(unsigned long pfn)
-{
-   unsigned long vaddr;
-   int idx, type;
-
-   preempt_disable();
-   pagefault_disable();
-
-   type = kmap_atomic_idx_push();
-   idx = type + KM_TYPE_NR*smp_processor_id();
-   vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
-   set_pte(kmap_pte-idx, pfn_pte(pfn, PAGE_KERNEL));
-   flush_tlb_one(vaddr);
-
-   return (void*) vaddr;
-}
-
-void __init kmap_init(void)
-{
-   unsigned long kmap_vstart;
-
-   /* cache the first kmap pte */
-   kmap_vstart = __fix_to_virt(FIX_KMAP_BEGIN);
-   kmap_pte = virt_to_kpte(kmap_vstart);
-}
--- a/arch/mips/mm/init.c
+++ b/arch/mips/mm/init.c
@@ -36,7 +36,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -402,9 +401,6 @@ void __

[patch V3 18/37] highmem: Get rid of kmap_types.h

2020-11-03 Thread Thomas Gleixner
The header is not longer used and on alpha, ia64, openrisc, parisc and um
it was completely unused anyway as these architectures have no highmem
support.

Signed-off-by: Thomas Gleixner 
---
V3: New patch
---
 arch/alpha/include/asm/kmap_types.h  |   15 ---
 arch/ia64/include/asm/kmap_types.h   |   13 -
 arch/openrisc/mm/init.c  |1 -
 arch/openrisc/mm/ioremap.c   |1 -
 arch/parisc/include/asm/kmap_types.h |   13 -
 arch/um/include/asm/fixmap.h |1 -
 arch/um/include/asm/kmap_types.h |   13 -
 include/asm-generic/Kbuild   |1 -
 include/asm-generic/kmap_types.h |   11 ---
 include/linux/highmem.h  |2 --
 10 files changed, 71 deletions(-)

--- a/arch/alpha/include/asm/kmap_types.h
+++ /dev/null
@@ -1,15 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _ASM_KMAP_TYPES_H
-#define _ASM_KMAP_TYPES_H
-
-/* Dummy header just to define km_type. */
-
-#ifdef CONFIG_DEBUG_HIGHMEM
-#define  __WITH_KM_FENCE
-#endif
-
-#include 
-
-#undef __WITH_KM_FENCE
-
-#endif
--- a/arch/ia64/include/asm/kmap_types.h
+++ /dev/null
@@ -1,13 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _ASM_IA64_KMAP_TYPES_H
-#define _ASM_IA64_KMAP_TYPES_H
-
-#ifdef CONFIG_DEBUG_HIGHMEM
-#define  __WITH_KM_FENCE
-#endif
-
-#include 
-
-#undef __WITH_KM_FENCE
-
-#endif /* _ASM_IA64_KMAP_TYPES_H */
--- a/arch/openrisc/mm/init.c
+++ b/arch/openrisc/mm/init.c
@@ -33,7 +33,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
--- a/arch/openrisc/mm/ioremap.c
+++ b/arch/openrisc/mm/ioremap.c
@@ -15,7 +15,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
--- a/arch/parisc/include/asm/kmap_types.h
+++ /dev/null
@@ -1,13 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _ASM_KMAP_TYPES_H
-#define _ASM_KMAP_TYPES_H
-
-#ifdef CONFIG_DEBUG_HIGHMEM
-#define  __WITH_KM_FENCE
-#endif
-
-#include 
-
-#undef __WITH_KM_FENCE
-
-#endif
--- a/arch/um/include/asm/fixmap.h
+++ b/arch/um/include/asm/fixmap.h
@@ -3,7 +3,6 @@
 #define __UM_FIXMAP_H
 
 #include 
-#include 
 #include 
 #include 
 #include 
--- a/arch/um/include/asm/kmap_types.h
+++ /dev/null
@@ -1,13 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-/* 
- * Copyright (C) 2002 Jeff Dike (jd...@karaya.com)
- */
-
-#ifndef __UM_KMAP_TYPES_H
-#define __UM_KMAP_TYPES_H
-
-/* No more #include "asm/arch/kmap_types.h" ! */
-
-#define KM_TYPE_NR 14
-
-#endif
--- a/include/asm-generic/Kbuild
+++ b/include/asm-generic/Kbuild
@@ -30,7 +30,6 @@ mandatory-y += irq.h
 mandatory-y += irq_regs.h
 mandatory-y += irq_work.h
 mandatory-y += kdebug.h
-mandatory-y += kmap_types.h
 mandatory-y += kmap_size.h
 mandatory-y += kprobes.h
 mandatory-y += linkage.h
--- a/include/asm-generic/kmap_types.h
+++ /dev/null
@@ -1,11 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _ASM_GENERIC_KMAP_TYPES_H
-#define _ASM_GENERIC_KMAP_TYPES_H
-
-#ifdef __WITH_KM_FENCE
-# define KM_TYPE_NR 41
-#else
-# define KM_TYPE_NR 20
-#endif
-
-#endif
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -29,8 +29,6 @@ static inline void invalidate_kernel_vma
 }
 #endif
 
-#include 
-
 /*
  * Outside of CONFIG_HIGHMEM to support X86 32bit iomap_atomic() cruft.
  */

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[patch V3 30/37] highmem: Remove kmap_atomic_pfn()

2020-11-03 Thread Thomas Gleixner
No more users.

Signed-off-by: Thomas Gleixner 
---
V3: New patch
---
 include/linux/highmem-internal.h |   12 
 1 file changed, 12 deletions(-)

--- a/include/linux/highmem-internal.h
+++ b/include/linux/highmem-internal.h
@@ -99,13 +99,6 @@ static inline void *kmap_atomic(struct p
return kmap_atomic_prot(page, kmap_prot);
 }
 
-static inline void *kmap_atomic_pfn(unsigned long pfn)
-{
-   preempt_disable();
-   pagefault_disable();
-   return __kmap_local_pfn_prot(pfn, kmap_prot);
-}
-
 static inline void __kunmap_atomic(void *addr)
 {
kunmap_local_indexed(addr);
@@ -193,11 +186,6 @@ static inline void *kmap_atomic_prot(str
return kmap_atomic(page);
 }
 
-static inline void *kmap_atomic_pfn(unsigned long pfn)
-{
-   return kmap_atomic(pfn_to_page(pfn));
-}
-
 static inline void __kunmap_atomic(void *addr)
 {
 #ifdef ARCH_HAS_FLUSH_ON_KUNMAP

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[patch V3 21/37] Documentation/io-mapping: Remove outdated blurb

2020-11-03 Thread Thomas Gleixner
The implementation details in the documentation are outdated and not really
helpful. Remove them.

Signed-off-by: Thomas Gleixner 
---
V3: New patch
---
 Documentation/driver-api/io-mapping.rst |   22 --
 1 file changed, 22 deletions(-)

--- a/Documentation/driver-api/io-mapping.rst
+++ b/Documentation/driver-api/io-mapping.rst
@@ -73,25 +73,3 @@ for pages mapped with io_mapping_map_wc.
 At driver close time, the io_mapping object must be freed::
 
void io_mapping_free(struct io_mapping *mapping)
-
-Current Implementation
-==
-
-The initial implementation of these functions uses existing mapping
-mechanisms and so provides only an abstraction layer and no new
-functionality.
-
-On 64-bit processors, io_mapping_create_wc calls ioremap_wc for the whole
-range, creating a permanent kernel-visible mapping to the resource. The
-map_atomic and map functions add the requested offset to the base of the
-virtual address returned by ioremap_wc.
-
-On 32-bit processors with HIGHMEM defined, io_mapping_map_atomic_wc uses
-kmap_atomic_pfn to map the specified page in an atomic fashion;
-kmap_atomic_pfn isn't really supposed to be used with device pages, but it
-provides an efficient mapping for this usage.
-
-On 32-bit processors without HIGHMEM defined, io_mapping_map_atomic_wc and
-io_mapping_map_wc both use ioremap_wc, a terribly inefficient function which
-performs an IPI to inform all processors about the new mapping. This results
-in a significant performance penalty.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[patch V3 25/37] mm/highmem: Provide kmap_local*

2020-11-03 Thread Thomas Gleixner
Now that the kmap atomic index is stored in task struct provide a
preemptible variant. On context switch the maps of an outgoing task are
removed and the map of the incoming task are restored. That's obviously
slow, but highmem is slow anyway.

The kmap_local.*() functions can be invoked from both preemptible and
atomic context. kmap local sections disable migration to keep the resulting
virtual mapping address correct, but disable neither pagefaults nor
preemption.

A wholesale conversion of kmap_atomic to be fully preemptible is not
possible because some of the usage sites might rely on the preemption
disable for serialization or on the implicit pagefault disable. Needs to be
done on a case by case basis.

Signed-off-by: Thomas Gleixner 
---
V3: Move migrate disable into the actual highmem mapping code so it only
affects real highmem mappings.
   
V2: Make it more consistent and add commentry
---
 include/linux/highmem-internal.h |   48 +++
 include/linux/highmem.h  |   43 +-
 mm/highmem.c |6 
 3 files changed, 81 insertions(+), 16 deletions(-)

--- a/include/linux/highmem-internal.h
+++ b/include/linux/highmem-internal.h
@@ -69,6 +69,26 @@ static inline void kmap_flush_unused(voi
__kmap_flush_unused();
 }
 
+static inline void *kmap_local_page(struct page *page)
+{
+   return __kmap_local_page_prot(page, kmap_prot);
+}
+
+static inline void *kmap_local_page_prot(struct page *page, pgprot_t prot)
+{
+   return __kmap_local_page_prot(page, prot);
+}
+
+static inline void *kmap_local_pfn(unsigned long pfn)
+{
+   return __kmap_local_pfn_prot(pfn, kmap_prot);
+}
+
+static inline void __kunmap_local(void *vaddr)
+{
+   kunmap_local_indexed(vaddr);
+}
+
 static inline void *kmap_atomic_prot(struct page *page, pgprot_t prot)
 {
preempt_disable();
@@ -141,6 +161,28 @@ static inline void kunmap(struct page *p
 #endif
 }
 
+static inline void *kmap_local_page(struct page *page)
+{
+   return page_address(page);
+}
+
+static inline void *kmap_local_page_prot(struct page *page, pgprot_t prot)
+{
+   return kmap_local_page(page);
+}
+
+static inline void *kmap_local_pfn(unsigned long pfn)
+{
+   return kmap_local_page(pfn_to_page(pfn));
+}
+
+static inline void __kunmap_local(void *addr)
+{
+#ifdef ARCH_HAS_FLUSH_ON_KUNMAP
+   kunmap_flush_on_unmap(addr);
+#endif
+}
+
 static inline void *kmap_atomic(struct page *page)
 {
preempt_disable();
@@ -182,4 +224,10 @@ do {   
\
__kunmap_atomic(__addr);\
 } while (0)
 
+#define kunmap_local(__addr)   \
+do {   \
+   BUILD_BUG_ON(__same_type((__addr), struct page *)); \
+   __kunmap_local(__addr); \
+} while (0)
+
 #endif
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -60,24 +60,22 @@ static inline struct page *kmap_to_page(
 static inline void kmap_flush_unused(void);
 
 /**
- * kmap_atomic - Atomically map a page for temporary usage
+ * kmap_local_page - Map a page for temporary usage
  * @page:  Pointer to the page to be mapped
  *
  * Returns: The virtual address of the mapping
  *
- * Side effect: On return pagefaults and preemption are disabled.
- *
  * Can be invoked from any context.
  *
  * Requires careful handling when nesting multiple mappings because the map
  * management is stack based. The unmap has to be in the reverse order of
  * the map operation:
  *
- * addr1 = kmap_atomic(page1);
- * addr2 = kmap_atomic(page2);
+ * addr1 = kmap_local_page(page1);
+ * addr2 = kmap_local_page(page2);
  * ...
- * kunmap_atomic(addr2);
- * kunmap_atomic(addr1);
+ * kunmap_local(addr2);
+ * kunmap_local(addr1);
  *
  * Unmapping addr1 before addr2 is invalid and causes malfunction.
  *
@@ -88,10 +86,26 @@ static inline void kmap_flush_unused(voi
  * virtual address of the direct mapping. Only real highmem pages are
  * temporarily mapped.
  *
- * While it is significantly faster than kmap() it comes with restrictions
- * about the pointer validity and the side effects of disabling page faults
- * and preemption. Use it only when absolutely necessary, e.g. from non
- * preemptible contexts.
+ * While it is significantly faster than kmap() for the higmem case it
+ * comes with restrictions about the pointer validity. Only use when really
+ * necessary.
+ *
+ * On HIGHMEM enabled systems mapping a highmem page has the side effect of
+ * disabling migration in order to keep the virtual address stable across
+ * preemption. No caller of kmap_local_page() can rely on this side effect.
+ */
+static inline void *kmap_local_page(struct page *page);
+
+/**
+ * kmap_atomic - Atomically map a page for temporary usage - Deprecated!
+ * @page:  Pointer to the page 

[patch V3 35/37] drm/nouveau/device: Replace io_mapping_map_atomic_wc()

2020-11-03 Thread Thomas Gleixner
Neither fbmem_peek() nor fbmem_poke() require to disable pagefaults and
preemption as a side effect of io_mapping_map_atomic_wc().

Use io_mapping_map_local_wc() instead.

Signed-off-by: Thomas Gleixner 
Cc: Ben Skeggs 
Cc: David Airlie 
Cc: Daniel Vetter 
Cc: dri-de...@lists.freedesktop.org
Cc: nouv...@lists.freedesktop.org
---
V3: New patch
---
 drivers/gpu/drm/nouveau/nvkm/subdev/devinit/fbmem.h |8 
 1 file changed, 4 insertions(+), 4 deletions(-)

--- a/drivers/gpu/drm/nouveau/nvkm/subdev/devinit/fbmem.h
+++ b/drivers/gpu/drm/nouveau/nvkm/subdev/devinit/fbmem.h
@@ -60,19 +60,19 @@ fbmem_fini(struct io_mapping *fb)
 static inline u32
 fbmem_peek(struct io_mapping *fb, u32 off)
 {
-   u8 __iomem *p = io_mapping_map_atomic_wc(fb, off & PAGE_MASK);
+   u8 __iomem *p = io_mapping_map_local_wc(fb, off & PAGE_MASK);
u32 val = ioread32(p + (off & ~PAGE_MASK));
-   io_mapping_unmap_atomic(p);
+   io_mapping_unmap_local(p);
return val;
 }
 
 static inline void
 fbmem_poke(struct io_mapping *fb, u32 off, u32 val)
 {
-   u8 __iomem *p = io_mapping_map_atomic_wc(fb, off & PAGE_MASK);
+   u8 __iomem *p = io_mapping_map_local_wc(fb, off & PAGE_MASK);
iowrite32(val, p + (off & ~PAGE_MASK));
wmb();
-   io_mapping_unmap_atomic(p);
+   io_mapping_unmap_local(p);
 }
 
 static inline bool

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[patch V3 17/37] xtensa/mm/highmem: Switch to generic kmap atomic

2020-11-03 Thread Thomas Gleixner
No reason having the same code in every architecture

Signed-off-by: Thomas Gleixner 
Cc: Chris Zankel 
Cc: Max Filippov 
Cc: linux-xte...@linux-xtensa.org
---
V3: Remove the kmap types cruft
---
 arch/xtensa/Kconfig   |1 
 arch/xtensa/include/asm/fixmap.h  |4 +--
 arch/xtensa/include/asm/highmem.h |   12 -
 arch/xtensa/mm/highmem.c  |   46 --
 4 files changed, 18 insertions(+), 45 deletions(-)

--- a/arch/xtensa/Kconfig
+++ b/arch/xtensa/Kconfig
@@ -666,6 +666,7 @@ endchoice
 config HIGHMEM
bool "High Memory Support"
depends on MMU
+   select KMAP_LOCAL
help
  Linux can use the full amount of RAM in the system by
  default. However, the default MMUv2 setup only maps the
--- a/arch/xtensa/include/asm/fixmap.h
+++ b/arch/xtensa/include/asm/fixmap.h
@@ -16,7 +16,7 @@
 #ifdef CONFIG_HIGHMEM
 #include 
 #include 
-#include 
+#include 
 #endif
 
 /*
@@ -39,7 +39,7 @@ enum fixed_addresses {
/* reserved pte's for temporary kernel mappings */
FIX_KMAP_BEGIN,
FIX_KMAP_END = FIX_KMAP_BEGIN +
-   (KM_TYPE_NR * NR_CPUS * DCACHE_N_COLORS) - 1,
+   (KM_MAX_IDX * NR_CPUS * DCACHE_N_COLORS) - 1,
 #endif
__end_of_fixed_addresses
 };
--- a/arch/xtensa/include/asm/highmem.h
+++ b/arch/xtensa/include/asm/highmem.h
@@ -16,9 +16,8 @@
 #include 
 #include 
 #include 
-#include 
 
-#define PKMAP_BASE ((FIXADDR_START - \
+#define PKMAP_BASE ((FIXADDR_START -   \
  (LAST_PKMAP + 1) * PAGE_SIZE) & PMD_MASK)
 #define LAST_PKMAP (PTRS_PER_PTE * DCACHE_N_COLORS)
 #define LAST_PKMAP_MASK(LAST_PKMAP - 1)
@@ -68,6 +67,15 @@ static inline void flush_cache_kmaps(voi
flush_cache_all();
 }
 
+enum fixed_addresses kmap_local_map_idx(int type, unsigned long pfn);
+#define arch_kmap_local_map_idxkmap_local_map_idx
+
+enum fixed_addresses kmap_local_unmap_idx(int type, unsigned long addr);
+#define arch_kmap_local_unmap_idx  kmap_local_unmap_idx
+
+#define arch_kmap_local_post_unmap(vaddr)  \
+   local_flush_tlb_kernel_range(vaddr, vaddr + PAGE_SIZE)
+
 void kmap_init(void);
 
 #endif
--- a/arch/xtensa/mm/highmem.c
+++ b/arch/xtensa/mm/highmem.c
@@ -12,8 +12,6 @@
 #include 
 #include 
 
-static pte_t *kmap_pte;
-
 #if DCACHE_WAY_SIZE > PAGE_SIZE
 unsigned int last_pkmap_nr_arr[DCACHE_N_COLORS];
 wait_queue_head_t pkmap_map_wait_arr[DCACHE_N_COLORS];
@@ -33,59 +31,25 @@ static inline void kmap_waitqueues_init(
 
 static inline enum fixed_addresses kmap_idx(int type, unsigned long color)
 {
-   return (type + KM_TYPE_NR * smp_processor_id()) * DCACHE_N_COLORS +
+   return (type + KM_MAX_IDX * smp_processor_id()) * DCACHE_N_COLORS +
color;
 }
 
-void *kmap_atomic_high_prot(struct page *page, pgprot_t prot)
+enum fixed_addresses kmap_local_map_idx(int type, unsigned long pfn)
 {
-   enum fixed_addresses idx;
-   unsigned long vaddr;
-
-   idx = kmap_idx(kmap_atomic_idx_push(),
-  DCACHE_ALIAS(page_to_phys(page)));
-   vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
-#ifdef CONFIG_DEBUG_HIGHMEM
-   BUG_ON(!pte_none(*(kmap_pte + idx)));
-#endif
-   set_pte(kmap_pte + idx, mk_pte(page, prot));
-
-   return (void *)vaddr;
+   return kmap_idx(type, DCACHE_ALIAS(pfn << PAGE_SHIFT));
 }
-EXPORT_SYMBOL(kmap_atomic_high_prot);
 
-void kunmap_atomic_high(void *kvaddr)
+enum fixed_addresses kmap_local_unmap_idx(int type, unsigned long addr)
 {
-   if (kvaddr >= (void *)FIXADDR_START &&
-   kvaddr < (void *)FIXADDR_TOP) {
-   int idx = kmap_idx(kmap_atomic_idx(),
-  DCACHE_ALIAS((unsigned long)kvaddr));
-
-   /*
-* Force other mappings to Oops if they'll try to access this
-* pte without first remap it.  Keeping stale mappings around
-* is a bad idea also, in case the page changes cacheability
-* attributes or becomes a protected page in a hypervisor.
-*/
-   pte_clear(&init_mm, kvaddr, kmap_pte + idx);
-   local_flush_tlb_kernel_range((unsigned long)kvaddr,
-(unsigned long)kvaddr + PAGE_SIZE);
-
-   kmap_atomic_idx_pop();
-   }
+   return kmap_idx(type, DCACHE_ALIAS(addr));
 }
-EXPORT_SYMBOL(kunmap_atomic_high);
 
 void __init kmap_init(void)
 {
-   unsigned long kmap_vstart;
-
/* Check if this memory layout is broken because PKMAP overlaps
 * page table.
 */
BUILD_BUG_ON(PKMAP_BASE < TLBTEMP_BASE_1 + TLBTEMP_SIZE);
-   /* cache the first kmap pte */
-   kmap_vstart = __fix_to_virt(FIX_KMAP_BEGIN);
-   kmap_pte = v

[patch V3 28/37] mips/crashdump: Simplify copy_oldmem_page()

2020-11-03 Thread Thomas Gleixner
Replace kmap_atomic_pfn() with kmap_local_pfn() which is preemptible and
can take page faults.

Remove the indirection of the dump page and the related cruft which is not
longer required.

Signed-off-by: Thomas Gleixner 
Cc: Thomas Bogendoerfer 
Cc: linux-m...@vger.kernel.org
---
V3: New patch
---
 arch/mips/kernel/crash_dump.c |   42 +++---
 1 file changed, 7 insertions(+), 35 deletions(-)

--- a/arch/mips/kernel/crash_dump.c
+++ b/arch/mips/kernel/crash_dump.c
@@ -5,8 +5,6 @@
 #include 
 #include 
 
-static void *kdump_buf_page;
-
 /**
  * copy_oldmem_page - copy one page from "oldmem"
  * @pfn: page frame number to be copied
@@ -17,51 +15,25 @@ static void *kdump_buf_page;
  * @userbuf: if set, @buf is in user address space, use copy_to_user(),
  * otherwise @buf is in kernel address space, use memcpy().
  *
- * Copy a page from "oldmem". For this page, there is no pte mapped
+ * Copy a page from "oldmem". For this page, there might be no pte mapped
  * in the current kernel.
- *
- * Calling copy_to_user() in atomic context is not desirable. Hence first
- * copying the data to a pre-allocated kernel page and then copying to user
- * space in non-atomic context.
  */
-ssize_t copy_oldmem_page(unsigned long pfn, char *buf,
-size_t csize, unsigned long offset, int userbuf)
+ssize_t copy_oldmem_page(unsigned long pfn, char *buf, size_t csize,
+unsigned long offset, int userbuf)
 {
void  *vaddr;
 
if (!csize)
return 0;
 
-   vaddr = kmap_atomic_pfn(pfn);
+   vaddr = kmap_local_pfn(pfn);
 
if (!userbuf) {
-   memcpy(buf, (vaddr + offset), csize);
-   kunmap_atomic(vaddr);
+   memcpy(buf, vaddr + offset, csize);
} else {
-   if (!kdump_buf_page) {
-   pr_warn("Kdump: Kdump buffer page not allocated\n");
-
-   return -EFAULT;
-   }
-   copy_page(kdump_buf_page, vaddr);
-   kunmap_atomic(vaddr);
-   if (copy_to_user(buf, (kdump_buf_page + offset), csize))
-   return -EFAULT;
+   if (copy_to_user(buf, vaddr + offset, csize))
+   csize = -EFAULT;
}
 
return csize;
 }
-
-static int __init kdump_buf_page_init(void)
-{
-   int ret = 0;
-
-   kdump_buf_page = kmalloc(PAGE_SIZE, GFP_KERNEL);
-   if (!kdump_buf_page) {
-   pr_warn("Kdump: Failed to allocate kdump buffer page\n");
-   ret = -ENOMEM;
-   }
-
-   return ret;
-}
-arch_initcall(kdump_buf_page_init);

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[patch V3 32/37] drm/vmgfx: Replace kmap_atomic()

2020-11-03 Thread Thomas Gleixner
There is no reason to disable pagefaults and preemption as a side effect of
kmap_atomic_prot().

Use kmap_local_page_prot() instead and document the reasoning for the
mapping usage with the given pgprot.

Remove the NULL pointer check for the map. These functions return a valid
address for valid pages and the return was bogus anyway as it would have
left preemption and pagefaults disabled.

Signed-off-by: Thomas Gleixner 
Cc: VMware Graphics 
Cc: Roland Scheidegger 
Cc: David Airlie 
Cc: Daniel Vetter 
Cc: dri-de...@lists.freedesktop.org
---
V3: New patch
---
 drivers/gpu/drm/vmwgfx/vmwgfx_blit.c |   30 --
 1 file changed, 12 insertions(+), 18 deletions(-)

--- a/drivers/gpu/drm/vmwgfx/vmwgfx_blit.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_blit.c
@@ -375,12 +375,12 @@ static int vmw_bo_cpu_blit_line(struct v
copy_size = min_t(u32, copy_size, PAGE_SIZE - src_page_offset);
 
if (unmap_src) {
-   kunmap_atomic(d->src_addr);
+   kunmap_local(d->src_addr);
d->src_addr = NULL;
}
 
if (unmap_dst) {
-   kunmap_atomic(d->dst_addr);
+   kunmap_local(d->dst_addr);
d->dst_addr = NULL;
}
 
@@ -388,12 +388,8 @@ static int vmw_bo_cpu_blit_line(struct v
if (WARN_ON_ONCE(dst_page >= d->dst_num_pages))
return -EINVAL;
 
-   d->dst_addr =
-   kmap_atomic_prot(d->dst_pages[dst_page],
-d->dst_prot);
-   if (!d->dst_addr)
-   return -ENOMEM;
-
+   d->dst_addr = 
kmap_local_page_prot(d->dst_pages[dst_page],
+  d->dst_prot);
d->mapped_dst = dst_page;
}
 
@@ -401,12 +397,8 @@ static int vmw_bo_cpu_blit_line(struct v
if (WARN_ON_ONCE(src_page >= d->src_num_pages))
return -EINVAL;
 
-   d->src_addr =
-   kmap_atomic_prot(d->src_pages[src_page],
-d->src_prot);
-   if (!d->src_addr)
-   return -ENOMEM;
-
+   d->src_addr = 
kmap_local_page_prot(d->src_pages[src_page],
+  d->src_prot);
d->mapped_src = src_page;
}
diff->do_cpy(diff, d->dst_addr + dst_page_offset,
@@ -436,8 +428,10 @@ static int vmw_bo_cpu_blit_line(struct v
  *
  * Performs a CPU blit from one buffer object to another avoiding a full
  * bo vmap which may exhaust- or fragment vmalloc space.
- * On supported architectures (x86), we're using kmap_atomic which avoids
- * cross-processor TLB- and cache flushes and may, on non-HIGHMEM systems
+ *
+ * On supported architectures (x86), we're using kmap_local_prot() which
+ * avoids cross-processor TLB- and cache flushes. kmap_local_prot() will
+ * either map a highmem page with the proper pgprot on HIGHMEM=y systems or
  * reference already set-up mappings.
  *
  * Neither of the buffer objects may be placed in PCI memory
@@ -500,9 +494,9 @@ int vmw_bo_cpu_blit(struct ttm_buffer_ob
}
 out:
if (d.src_addr)
-   kunmap_atomic(d.src_addr);
+   kunmap_local(d.src_addr);
if (d.dst_addr)
-   kunmap_atomic(d.dst_addr);
+   kunmap_local(d.dst_addr);
 
return ret;
 }

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[patch V3 29/37] ARM: mm: Replace kmap_atomic_pfn()

2020-11-03 Thread Thomas Gleixner
There is no requirement to disable pagefaults and preemption for these
cache management mappings.

Replace kmap_atomic_pfn() with kmap_local_pfn(). This allows to remove
kmap_atomic_pfn() in the next step.

Signed-off-by: Thomas Gleixner 
Cc: Russell King 
Cc: linux-arm-ker...@lists.infradead.org
---
V3: New patch
---
 arch/arm/mm/cache-feroceon-l2.c |6 +++---
 arch/arm/mm/cache-xsc3l2.c  |4 ++--
 2 files changed, 5 insertions(+), 5 deletions(-)

--- a/arch/arm/mm/cache-feroceon-l2.c
+++ b/arch/arm/mm/cache-feroceon-l2.c
@@ -49,9 +49,9 @@ static inline unsigned long l2_get_va(un
 * we simply install a virtual mapping for it only for the
 * TLB lookup to occur, hence no need to flush the untouched
 * memory mapping afterwards (note: a cache flush may happen
-* in some circumstances depending on the path taken in kunmap_atomic).
+* in some circumstances depending on the path taken in kunmap_local).
 */
-   void *vaddr = kmap_atomic_pfn(paddr >> PAGE_SHIFT);
+   void *vaddr = kmap_local_pfn(paddr >> PAGE_SHIFT);
return (unsigned long)vaddr + (paddr & ~PAGE_MASK);
 #else
return __phys_to_virt(paddr);
@@ -61,7 +61,7 @@ static inline unsigned long l2_get_va(un
 static inline void l2_put_va(unsigned long vaddr)
 {
 #ifdef CONFIG_HIGHMEM
-   kunmap_atomic((void *)vaddr);
+   kunmap_local((void *)vaddr);
 #endif
 }
 
--- a/arch/arm/mm/cache-xsc3l2.c
+++ b/arch/arm/mm/cache-xsc3l2.c
@@ -59,7 +59,7 @@ static inline void l2_unmap_va(unsigned
 {
 #ifdef CONFIG_HIGHMEM
if (va != -1)
-   kunmap_atomic((void *)va);
+   kunmap_local((void *)va);
 #endif
 }
 
@@ -75,7 +75,7 @@ static inline unsigned long l2_map_va(un
 * in place for it.
 */
l2_unmap_va(prev_va);
-   va = (unsigned long)kmap_atomic_pfn(pa >> PAGE_SHIFT);
+   va = (unsigned long)kmap_local_pfn(pa >> PAGE_SHIFT);
}
return va + (pa_offset >> (32 - PAGE_SHIFT));
 #else

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[patch V3 26/37] io-mapping: Provide iomap_local variant

2020-11-03 Thread Thomas Gleixner
Similar to kmap local provide a iomap local variant which only disables
migration, but neither disables pagefaults nor preemption.

Signed-off-by: Thomas Gleixner 
---
V3: Restrict migrate disable to the 32bit mapping case and update documentation.

V2: Split out from the large combo patch and add the !IOMAP_ATOMIC variants
---
 Documentation/driver-api/io-mapping.rst |   76 +++-
 include/linux/io-mapping.h  |   30 +++-
 2 files changed, 74 insertions(+), 32 deletions(-)

--- a/Documentation/driver-api/io-mapping.rst
+++ b/Documentation/driver-api/io-mapping.rst
@@ -20,55 +20,71 @@ as it would consume too much of the kern
 mappable, while 'size' indicates how large a mapping region to
 enable. Both are in bytes.
 
-This _wc variant provides a mapping which may only be used
-with the io_mapping_map_atomic_wc or io_mapping_map_wc.
+This _wc variant provides a mapping which may only be used with
+io_mapping_map_atomic_wc(), io_mapping_map_local_wc() or
+io_mapping_map_wc().
+
+With this mapping object, individual pages can be mapped either temporarily
+or long term, depending on the requirements. Of course, temporary maps are
+more efficient. They come in two flavours::
 
-With this mapping object, individual pages can be mapped either atomically
-or not, depending on the necessary scheduling environment. Of course, atomic
-maps are more efficient::
+   void *io_mapping_map_local_wc(struct io_mapping *mapping,
+ unsigned long offset)
 
void *io_mapping_map_atomic_wc(struct io_mapping *mapping,
   unsigned long offset)
 
-'offset' is the offset within the defined mapping region.
-Accessing addresses beyond the region specified in the
-creation function yields undefined results. Using an offset
-which is not page aligned yields an undefined result. The
-return value points to a single page in CPU address space.
-
-This _wc variant returns a write-combining map to the
-page and may only be used with mappings created by
-io_mapping_create_wc
+'offset' is the offset within the defined mapping region.  Accessing
+addresses beyond the region specified in the creation function yields
+undefined results. Using an offset which is not page aligned yields an
+undefined result. The return value points to a single page in CPU address
+space.
 
-Note that the task may not sleep while holding this page
-mapped.
+This _wc variant returns a write-combining map to the page and may only be
+used with mappings created by io_mapping_create_wc()
 
-::
+Temporary mappings are only valid in the context of the caller. The mapping
+is not guaranteed to be globaly visible.
 
-   void io_mapping_unmap_atomic(void *vaddr)
+io_mapping_map_local_wc() has a side effect on X86 32bit as it disables
+migration to make the mapping code work. No caller can rely on this side
+effect.
+
+io_mapping_map_atomic_wc() has the side effect of disabling preemption and
+pagefaults. Don't use in new code. Use io_mapping_map_local_wc() instead.
 
-'vaddr' must be the value returned by the last
-io_mapping_map_atomic_wc call. This unmaps the specified
-page and allows the task to sleep once again.
+Nested mappings need to be undone in reverse order because the mapping
+code uses a stack for keeping track of them::
 
-If you need to sleep while holding the lock, you can use the non-atomic
-variant, although they may be significantly slower.
+ addr1 = io_mapping_map_local_wc(map1, offset1);
+ addr2 = io_mapping_map_local_wc(map2, offset2);
+ ...
+ io_mapping_unmap_local(addr2);
+ io_mapping_unmap_local(addr1);
 
-::
+The mappings are released with::
+
+   void io_mapping_unmap_local(void *vaddr)
+   void io_mapping_unmap_atomic(void *vaddr)
+
+'vaddr' must be the value returned by the last io_mapping_map_local_wc() or
+io_mapping_map_atomic_wc() call. This unmaps the specified mapping and
+undoes the side effects of the mapping functions.
+
+If you need to sleep while holding a mapping, you can use the regular
+variant, although this may be significantly slower::
 
void *io_mapping_map_wc(struct io_mapping *mapping,
unsigned long offset)
 
-This works like io_mapping_map_atomic_wc except it allows
-the task to sleep while holding the page mapped.
-
+This works like io_mapping_map_atomic/local_wc() except it has no side
+effects and the pointer is globaly visible.
 
-::
+The mappings are released with::
 
void io_mapping_unmap(void *vaddr)
 
-This works like io_mapping_unmap_atomic, except it is used
-for pages mapped with io_mapping_map_wc.
+Use for pages mapped with io_mapping_map_wc().
 
 At driver close time, the io_mapping object must be freed::
 
--- a/include/linux/io-mapping.h
+++ b/include/linux/io-mapping.h
@@ -83,6 +83,21 @@ io_mapping_unmap_atomic(void __iomem *va
 }
 
 static inline void __iomem *
+io_ma

[patch V3 33/37] highmem: Remove kmap_atomic_prot()

2020-11-03 Thread Thomas Gleixner
No more users.

Signed-off-by: Thomas Gleixner 
---
V3: New patch
---
 include/linux/highmem-internal.h |   14 ++
 1 file changed, 2 insertions(+), 12 deletions(-)

--- a/include/linux/highmem-internal.h
+++ b/include/linux/highmem-internal.h
@@ -87,16 +87,11 @@ static inline void __kunmap_local(void *
kunmap_local_indexed(vaddr);
 }
 
-static inline void *kmap_atomic_prot(struct page *page, pgprot_t prot)
+static inline void *kmap_atomic(struct page *page)
 {
preempt_disable();
pagefault_disable();
-   return __kmap_local_page_prot(page, prot);
-}
-
-static inline void *kmap_atomic(struct page *page)
-{
-   return kmap_atomic_prot(page, kmap_prot);
+   return __kmap_local_page_prot(page, kmap_prot);
 }
 
 static inline void __kunmap_atomic(void *addr)
@@ -181,11 +176,6 @@ static inline void *kmap_atomic(struct p
return page_address(page);
 }
 
-static inline void *kmap_atomic_prot(struct page *page, pgprot_t prot)
-{
-   return kmap_atomic(page);
-}
-
 static inline void __kunmap_atomic(void *addr)
 {
 #ifdef ARCH_HAS_FLUSH_ON_KUNMAP

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[patch V3 34/37] drm/qxl: Replace io_mapping_map_atomic_wc()

2020-11-03 Thread Thomas Gleixner
None of these mapping requires the side effect of disabling pagefaults and
preemption.

Use io_mapping_map_local_wc() instead, rename the related functions
accordingly and clean up qxl_process_single_command() to use a plain
copy_from_user() as the local maps are not disabling pagefaults.

Signed-off-by: Thomas Gleixner 
Cc: Dave Airlie 
Cc: Gerd Hoffmann 
Cc: David Airlie 
Cc: Daniel Vetter 
Cc: virtualization@lists.linux-foundation.org
Cc: spice-de...@lists.freedesktop.org
---
V3: New patch
---
 drivers/gpu/drm/qxl/qxl_image.c   |   18 +-
 drivers/gpu/drm/qxl/qxl_ioctl.c   |   27 +--
 drivers/gpu/drm/qxl/qxl_object.c  |   12 ++--
 drivers/gpu/drm/qxl/qxl_object.h  |4 ++--
 drivers/gpu/drm/qxl/qxl_release.c |4 ++--
 5 files changed, 32 insertions(+), 33 deletions(-)

--- a/drivers/gpu/drm/qxl/qxl_image.c
+++ b/drivers/gpu/drm/qxl/qxl_image.c
@@ -124,12 +124,12 @@ qxl_image_init_helper(struct qxl_device
  wrong (check the bitmaps are sent correctly
  first) */
 
-   ptr = qxl_bo_kmap_atomic_page(qdev, chunk_bo, 0);
+   ptr = qxl_bo_kmap_local_page(qdev, chunk_bo, 0);
chunk = ptr;
chunk->data_size = height * chunk_stride;
chunk->prev_chunk = 0;
chunk->next_chunk = 0;
-   qxl_bo_kunmap_atomic_page(qdev, chunk_bo, ptr);
+   qxl_bo_kunmap_local_page(qdev, chunk_bo, ptr);
 
{
void *k_data, *i_data;
@@ -143,7 +143,7 @@ qxl_image_init_helper(struct qxl_device
i_data = (void *)data;
 
while (remain > 0) {
-   ptr = qxl_bo_kmap_atomic_page(qdev, chunk_bo, 
page << PAGE_SHIFT);
+   ptr = qxl_bo_kmap_local_page(qdev, chunk_bo, 
page << PAGE_SHIFT);
 
if (page == 0) {
chunk = ptr;
@@ -157,7 +157,7 @@ qxl_image_init_helper(struct qxl_device
 
memcpy(k_data, i_data, size);
 
-   qxl_bo_kunmap_atomic_page(qdev, chunk_bo, ptr);
+   qxl_bo_kunmap_local_page(qdev, chunk_bo, ptr);
i_data += size;
remain -= size;
page++;
@@ -175,10 +175,10 @@ qxl_image_init_helper(struct qxl_device
page_offset = 
offset_in_page(out_offset);
size = min((int)(PAGE_SIZE - 
page_offset), remain);
 
-   ptr = qxl_bo_kmap_atomic_page(qdev, 
chunk_bo, page_base);
+   ptr = qxl_bo_kmap_local_page(qdev, 
chunk_bo, page_base);
k_data = ptr + page_offset;
memcpy(k_data, i_data, size);
-   qxl_bo_kunmap_atomic_page(qdev, 
chunk_bo, ptr);
+   qxl_bo_kunmap_local_page(qdev, 
chunk_bo, ptr);
remain -= size;
i_data += size;
out_offset += size;
@@ -189,7 +189,7 @@ qxl_image_init_helper(struct qxl_device
qxl_bo_kunmap(chunk_bo);
 
image_bo = dimage->bo;
-   ptr = qxl_bo_kmap_atomic_page(qdev, image_bo, 0);
+   ptr = qxl_bo_kmap_local_page(qdev, image_bo, 0);
image = ptr;
 
image->descriptor.id = 0;
@@ -212,7 +212,7 @@ qxl_image_init_helper(struct qxl_device
break;
default:
DRM_ERROR("unsupported image bit depth\n");
-   qxl_bo_kunmap_atomic_page(qdev, image_bo, ptr);
+   qxl_bo_kunmap_local_page(qdev, image_bo, ptr);
return -EINVAL;
}
image->u.bitmap.flags = QXL_BITMAP_TOP_DOWN;
@@ -222,7 +222,7 @@ qxl_image_init_helper(struct qxl_device
image->u.bitmap.palette = 0;
image->u.bitmap.data = qxl_bo_physical_address(qdev, chunk_bo, 0);
 
-   qxl_bo_kunmap_atomic_page(qdev, image_bo, ptr);
+   qxl_bo_kunmap_local_page(qdev, image_bo, ptr);
 
return 0;
 }
--- a/drivers/gpu/drm/qxl/qxl_ioctl.c
+++ b/drivers/gpu/drm/qxl/qxl_ioctl.c
@@ -89,11 +89,11 @@ apply_reloc(struct qxl_device *qdev, str
 {
void *reloc_page;
 
-   reloc_page = qxl_bo_kmap_atomic_page(qdev, info->dst_bo, 
info->dst_offset & PAGE_MASK);
+   reloc_page = qxl_bo_kmap_local_page(qdev, info->dst_bo, 
info->dst_offset & PAGE_MASK);
*(uint64_t *)(reloc_page + (info->dst_offset & ~PAGE_MASK)) = 
qxl_bo_physical_address(qdev,

  info->src_bo,

[patch V3 22/37] highmem: High implementation details and document API

2020-11-03 Thread Thomas Gleixner
Move the gory details of kmap & al into a private header and only document
the interfaces which are usable by drivers.

Signed-off-by: Thomas Gleixner 
---
V3: New patch
---
 include/linux/highmem-internal.h |  174 +
 include/linux/highmem.h  |  270 ++-
 mm/highmem.c |   11 -
 3 files changed, 276 insertions(+), 179 deletions(-)

--- /dev/null
+++ b/include/linux/highmem-internal.h
@@ -0,0 +1,174 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_HIGHMEM_INTERNAL_H
+#define _LINUX_HIGHMEM_INTERNAL_H
+
+/*
+ * Outside of CONFIG_HIGHMEM to support X86 32bit iomap_atomic() cruft.
+ */
+#ifdef CONFIG_KMAP_LOCAL
+void *__kmap_local_pfn_prot(unsigned long pfn, pgprot_t prot);
+void *__kmap_local_page_prot(struct page *page, pgprot_t prot);
+void kunmap_local_indexed(void *vaddr);
+#endif
+
+#ifdef CONFIG_HIGHMEM
+#include 
+
+#ifndef ARCH_HAS_KMAP_FLUSH_TLB
+static inline void kmap_flush_tlb(unsigned long addr) { }
+#endif
+
+#ifndef kmap_prot
+#define kmap_prot PAGE_KERNEL
+#endif
+
+void *kmap_high(struct page *page);
+void kunmap_high(struct page *page);
+void __kmap_flush_unused(void);
+struct page *__kmap_to_page(void *addr);
+
+static inline void *kmap(struct page *page)
+{
+   void *addr;
+
+   might_sleep();
+   if (!PageHighMem(page))
+   addr = page_address(page);
+   else
+   addr = kmap_high(page);
+   kmap_flush_tlb((unsigned long)addr);
+   return addr;
+}
+
+static inline void kunmap(struct page *page)
+{
+   might_sleep();
+   if (!PageHighMem(page))
+   return;
+   kunmap_high(page);
+}
+
+static inline struct page *kmap_to_page(void *addr)
+{
+   return __kmap_to_page(addr);
+}
+
+static inline void kmap_flush_unused(void)
+{
+   __kmap_flush_unused();
+}
+
+static inline void *kmap_atomic_prot(struct page *page, pgprot_t prot)
+{
+   preempt_disable();
+   pagefault_disable();
+   return __kmap_local_page_prot(page, prot);
+}
+
+static inline void *kmap_atomic(struct page *page)
+{
+   return kmap_atomic_prot(page, kmap_prot);
+}
+
+static inline void *kmap_atomic_pfn(unsigned long pfn)
+{
+   preempt_disable();
+   pagefault_disable();
+   return __kmap_local_pfn_prot(pfn, kmap_prot);
+}
+
+static inline void __kunmap_atomic(void *addr)
+{
+   kunmap_local_indexed(addr);
+   pagefault_enable();
+   preempt_enable();
+}
+
+unsigned int __nr_free_highpages(void);
+extern atomic_long_t _totalhigh_pages;
+
+static inline unsigned int nr_free_highpages(void)
+{
+   return __nr_free_highpages();
+}
+
+static inline unsigned long totalhigh_pages(void)
+{
+   return (unsigned long)atomic_long_read(&_totalhigh_pages);
+}
+
+static inline void totalhigh_pages_inc(void)
+{
+   atomic_long_inc(&_totalhigh_pages);
+}
+
+static inline void totalhigh_pages_add(long count)
+{
+   atomic_long_add(count, &_totalhigh_pages);
+}
+
+#else /* CONFIG_HIGHMEM */
+
+static inline struct page *kmap_to_page(void *addr)
+{
+   return virt_to_page(addr);
+}
+
+static inline void *kmap(struct page *page)
+{
+   might_sleep();
+   return page_address(page);
+}
+
+static inline void kunmap_high(struct page *page) { }
+static inline void kmap_flush_unused(void) { }
+
+static inline void kunmap(struct page *page)
+{
+#ifdef ARCH_HAS_FLUSH_ON_KUNMAP
+   kunmap_flush_on_unmap(page_address(page));
+#endif
+}
+
+static inline void *kmap_atomic(struct page *page)
+{
+   preempt_disable();
+   pagefault_disable();
+   return page_address(page);
+}
+
+static inline void *kmap_atomic_prot(struct page *page, pgprot_t prot)
+{
+   return kmap_atomic(page);
+}
+
+static inline void *kmap_atomic_pfn(unsigned long pfn)
+{
+   return kmap_atomic(pfn_to_page(pfn));
+}
+
+static inline void __kunmap_atomic(void *addr)
+{
+#ifdef ARCH_HAS_FLUSH_ON_KUNMAP
+   kunmap_flush_on_unmap(addr);
+#endif
+   pagefault_enable();
+   preempt_enable();
+}
+
+static inline unsigned int nr_free_highpages(void) { return 0; }
+static inline unsigned long totalhigh_pages(void) { return 0UL; }
+
+#endif /* CONFIG_HIGHMEM */
+
+/*
+ * Prevent people trying to call kunmap_atomic() as if it were kunmap()
+ * kunmap_atomic() should get the return value of kmap_atomic, not the page.
+ */
+#define kunmap_atomic(__addr)  \
+do {   \
+   BUILD_BUG_ON(__same_type((__addr), struct page *)); \
+   __kunmap_atomic(__addr);\
+} while (0)
+
+#endif
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -11,199 +11,125 @@
 
 #include 
 
-#ifndef ARCH_HAS_FLUSH_ANON_PAGE
-static inline void flush_anon_page(struct vm_area_struct *vma, struct page 
*page, unsigned long vmaddr)
-{
-}
-#endif
+#include "highmem-internal.h"
 
-#ifndef ARCH_HA

[patch V3 36/37] drm/i915: Replace io_mapping_map_atomic_wc()

2020-11-03 Thread Thomas Gleixner
None of these mapping requires the side effect of disabling pagefaults and
preemption.

Use io_mapping_map_local_wc() instead, and clean up gtt_user_read() and
gtt_user_write() to use a plain copy_from_user() as the local maps are not
disabling pagefaults.

Signed-off-by: Thomas Gleixner 
Cc: Jani Nikula 
Cc: Joonas Lahtinen 
Cc: Rodrigo Vivi 
Cc: David Airlie 
Cc: Daniel Vetter 
Cc: intel-...@lists.freedesktop.org
Cc: dri-de...@lists.freedesktop.org
---
V3: New patch
---
 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c |7 +---
 drivers/gpu/drm/i915/i915_gem.c|   40 -
 drivers/gpu/drm/i915/selftests/i915_gem.c  |4 +-
 drivers/gpu/drm/i915/selftests/i915_gem_gtt.c  |8 ++---
 4 files changed, 22 insertions(+), 37 deletions(-)

--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -1081,7 +1081,7 @@ static void reloc_cache_reset(struct rel
struct i915_ggtt *ggtt = cache_to_ggtt(cache);
 
intel_gt_flush_ggtt_writes(ggtt->vm.gt);
-   io_mapping_unmap_atomic((void __iomem *)vaddr);
+   io_mapping_unmap_local((void __iomem *)vaddr);
 
if (drm_mm_node_allocated(&cache->node)) {
ggtt->vm.clear_range(&ggtt->vm,
@@ -1147,7 +1147,7 @@ static void *reloc_iomap(struct drm_i915
 
if (cache->vaddr) {
intel_gt_flush_ggtt_writes(ggtt->vm.gt);
-   io_mapping_unmap_atomic((void __force __iomem *) 
unmask_page(cache->vaddr));
+   io_mapping_unmap_local((void __force __iomem *) 
unmask_page(cache->vaddr));
} else {
struct i915_vma *vma;
int err;
@@ -1195,8 +1195,7 @@ static void *reloc_iomap(struct drm_i915
offset += page << PAGE_SHIFT;
}
 
-   vaddr = (void __force *)io_mapping_map_atomic_wc(&ggtt->iomap,
-offset);
+   vaddr = (void __force *)io_mapping_map_local_wc(&ggtt->iomap, offset);
cache->page = page;
cache->vaddr = (unsigned long)vaddr;
 
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -379,22 +379,15 @@ gtt_user_read(struct io_mapping *mapping
  char __user *user_data, int length)
 {
void __iomem *vaddr;
-   unsigned long unwritten;
+   bool fail = false;
 
/* We can use the cpu mem copy function because this is X86. */
-   vaddr = io_mapping_map_atomic_wc(mapping, base);
-   unwritten = __copy_to_user_inatomic(user_data,
-   (void __force *)vaddr + offset,
-   length);
-   io_mapping_unmap_atomic(vaddr);
-   if (unwritten) {
-   vaddr = io_mapping_map_wc(mapping, base, PAGE_SIZE);
-   unwritten = copy_to_user(user_data,
-(void __force *)vaddr + offset,
-length);
-   io_mapping_unmap(vaddr);
-   }
-   return unwritten;
+   vaddr = io_mapping_map_local_wc(mapping, base);
+   if (copy_to_user(user_data, (void __force *)vaddr + offset, length))
+   fail = true;
+   io_mapping_unmap_local(vaddr);
+
+   return fail;
 }
 
 static int
@@ -557,21 +550,14 @@ ggtt_write(struct io_mapping *mapping,
   char __user *user_data, int length)
 {
void __iomem *vaddr;
-   unsigned long unwritten;
+   bool fail = false;
 
/* We can use the cpu mem copy function because this is X86. */
-   vaddr = io_mapping_map_atomic_wc(mapping, base);
-   unwritten = __copy_from_user_inatomic_nocache((void __force *)vaddr + 
offset,
- user_data, length);
-   io_mapping_unmap_atomic(vaddr);
-   if (unwritten) {
-   vaddr = io_mapping_map_wc(mapping, base, PAGE_SIZE);
-   unwritten = copy_from_user((void __force *)vaddr + offset,
-  user_data, length);
-   io_mapping_unmap(vaddr);
-   }
-
-   return unwritten;
+   vaddr = io_mapping_map_local_wc(mapping, base);
+   if (copy_from_user((void __force *)vaddr + offset, user_data, length))
+   fail = true;
+   io_mapping_unmap_local(vaddr);
+   return fail;
 }
 
 /**
--- a/drivers/gpu/drm/i915/selftests/i915_gem.c
+++ b/drivers/gpu/drm/i915/selftests/i915_gem.c
@@ -57,12 +57,12 @@ static void trash_stolen(struct drm_i915
 
ggtt->vm.insert_page(&ggtt->vm, dma, slot, I915_CACHE_NONE, 0);
 
-   s = io_mapping_map_atomic_wc(&ggtt->iomap, slot);
+   s = io_mapping_map_local_wc(&ggtt->iomap, slot);
for (x = 0; x < PAGE_SIZE / sizeof(u32); x++) {
  

[patch V3 11/37] csky/mm/highmem: Switch to generic kmap atomic

2020-11-03 Thread Thomas Gleixner
No reason having the same code in every architecture.

Signed-off-by: Thomas Gleixner 
Cc: linux-c...@vger.kernel.org
---
V3: Does not compile with gcc 10
---
 arch/csky/Kconfig   |1 
 arch/csky/include/asm/fixmap.h  |4 +-
 arch/csky/include/asm/highmem.h |6 ++-
 arch/csky/mm/highmem.c  |   75 
 4 files changed, 8 insertions(+), 78 deletions(-)

--- a/arch/csky/Kconfig
+++ b/arch/csky/Kconfig
@@ -286,6 +286,7 @@ config NR_CPUS
 config HIGHMEM
bool "High Memory Support"
depends on !CPU_CK610
+   select KMAP_LOCAL
default y
 
 config FORCE_MAX_ZONEORDER
--- a/arch/csky/include/asm/fixmap.h
+++ b/arch/csky/include/asm/fixmap.h
@@ -8,7 +8,7 @@
 #include 
 #ifdef CONFIG_HIGHMEM
 #include 
-#include 
+#include 
 #endif
 
 enum fixed_addresses {
@@ -17,7 +17,7 @@ enum fixed_addresses {
 #endif
 #ifdef CONFIG_HIGHMEM
FIX_KMAP_BEGIN,
-   FIX_KMAP_END = FIX_KMAP_BEGIN + (KM_TYPE_NR * NR_CPUS) - 1,
+   FIX_KMAP_END = FIX_KMAP_BEGIN + (KM_MAX_IDX * NR_CPUS) - 1,
 #endif
__end_of_fixed_addresses
 };
--- a/arch/csky/include/asm/highmem.h
+++ b/arch/csky/include/asm/highmem.h
@@ -9,7 +9,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
 
 /* undef for production */
@@ -32,10 +32,12 @@ extern pte_t *pkmap_page_table;
 
 #define ARCH_HAS_KMAP_FLUSH_TLB
 extern void kmap_flush_tlb(unsigned long addr);
-extern void *kmap_atomic_pfn(unsigned long pfn);
 
 #define flush_cache_kmaps() do {} while (0)
 
+#define arch_kmap_local_post_map(vaddr, pteval)kmap_flush_tlb(vaddr)
+#define arch_kmap_local_post_unmap(vaddr)  kmap_flush_tlb(vaddr)
+
 extern void kmap_init(void);
 
 #endif /* __KERNEL__ */
--- a/arch/csky/mm/highmem.c
+++ b/arch/csky/mm/highmem.c
@@ -9,8 +9,6 @@
 #include 
 #include 
 
-static pte_t *kmap_pte;
-
 unsigned long highstart_pfn, highend_pfn;
 
 void kmap_flush_tlb(unsigned long addr)
@@ -19,67 +17,7 @@ void kmap_flush_tlb(unsigned long addr)
 }
 EXPORT_SYMBOL(kmap_flush_tlb);
 
-void *kmap_atomic_high_prot(struct page *page, pgprot_t prot)
-{
-   unsigned long vaddr;
-   int idx, type;
-
-   type = kmap_atomic_idx_push();
-   idx = type + KM_TYPE_NR*smp_processor_id();
-   vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
-#ifdef CONFIG_DEBUG_HIGHMEM
-   BUG_ON(!pte_none(*(kmap_pte - idx)));
-#endif
-   set_pte(kmap_pte-idx, mk_pte(page, prot));
-   flush_tlb_one((unsigned long)vaddr);
-
-   return (void *)vaddr;
-}
-EXPORT_SYMBOL(kmap_atomic_high_prot);
-
-void kunmap_atomic_high(void *kvaddr)
-{
-   unsigned long vaddr = (unsigned long) kvaddr & PAGE_MASK;
-   int idx;
-
-   if (vaddr < FIXADDR_START)
-   return;
-
-#ifdef CONFIG_DEBUG_HIGHMEM
-   idx = KM_TYPE_NR*smp_processor_id() + kmap_atomic_idx();
-
-   BUG_ON(vaddr != __fix_to_virt(FIX_KMAP_BEGIN + idx));
-
-   pte_clear(&init_mm, vaddr, kmap_pte - idx);
-   flush_tlb_one(vaddr);
-#else
-   (void) idx; /* to kill a warning */
-#endif
-   kmap_atomic_idx_pop();
-}
-EXPORT_SYMBOL(kunmap_atomic_high);
-
-/*
- * This is the same as kmap_atomic() but can map memory that doesn't
- * have a struct page associated with it.
- */
-void *kmap_atomic_pfn(unsigned long pfn)
-{
-   unsigned long vaddr;
-   int idx, type;
-
-   pagefault_disable();
-
-   type = kmap_atomic_idx_push();
-   idx = type + KM_TYPE_NR*smp_processor_id();
-   vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
-   set_pte(kmap_pte-idx, pfn_pte(pfn, PAGE_KERNEL));
-   flush_tlb_one(vaddr);
-
-   return (void *) vaddr;
-}
-
-static void __init kmap_pages_init(void)
+void __init kmap_init(void)
 {
unsigned long vaddr;
pgd_t *pgd;
@@ -96,14 +34,3 @@ static void __init kmap_pages_init(void)
pte = pte_offset_kernel(pmd, vaddr);
pkmap_page_table = pte;
 }
-
-void __init kmap_init(void)
-{
-   unsigned long vaddr;
-
-   kmap_pages_init();
-
-   vaddr = __fix_to_virt(FIX_KMAP_BEGIN);
-
-   kmap_pte = pte_offset_kernel((pmd_t *)pgd_offset_k(vaddr), vaddr);
-}

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[patch V3 14/37] nds32/mm/highmem: Switch to generic kmap atomic

2020-11-03 Thread Thomas Gleixner
The mapping code is odd and looks broken. See FIXME in the comment.

Also fix the harmless off by one in the FIX_KMAP_END define.

Signed-off-by: Thomas Gleixner 
Cc: Nick Hu 
Cc: Greentime Hu 
Cc: Vincent Chen 
---
V3: Remove the kmap types cruft
---
 arch/nds32/Kconfig.cpu   |1 
 arch/nds32/include/asm/fixmap.h  |4 +--
 arch/nds32/include/asm/highmem.h |   22 +
 arch/nds32/mm/Makefile   |1 
 arch/nds32/mm/highmem.c  |   48 ---
 5 files changed, 19 insertions(+), 57 deletions(-)

--- a/arch/nds32/Kconfig.cpu
+++ b/arch/nds32/Kconfig.cpu
@@ -157,6 +157,7 @@ config HW_SUPPORT_UNALIGNMENT_ACCESS
 config HIGHMEM
bool "High Memory Support"
depends on MMU && !CPU_CACHE_ALIASING
+   select KMAP_LOCAL
help
  The address space of Andes processors is only 4 Gigabytes large
  and it has to accommodate user address space, kernel address
--- a/arch/nds32/include/asm/fixmap.h
+++ b/arch/nds32/include/asm/fixmap.h
@@ -6,7 +6,7 @@
 
 #ifdef CONFIG_HIGHMEM
 #include 
-#include 
+#include 
 #endif
 
 enum fixed_addresses {
@@ -14,7 +14,7 @@ enum fixed_addresses {
FIX_KMAP_RESERVED,
FIX_KMAP_BEGIN,
 #ifdef CONFIG_HIGHMEM
-   FIX_KMAP_END = FIX_KMAP_BEGIN + (KM_TYPE_NR * NR_CPUS),
+   FIX_KMAP_END = FIX_KMAP_BEGIN + (KM_MAX_IDX * NR_CPUS) - 1,
 #endif
FIX_EARLYCON_MEM_BASE,
__end_of_fixed_addresses
--- a/arch/nds32/include/asm/highmem.h
+++ b/arch/nds32/include/asm/highmem.h
@@ -5,7 +5,6 @@
 #define _ASM_HIGHMEM_H
 
 #include 
-#include 
 #include 
 
 /*
@@ -45,11 +44,22 @@ extern pte_t *pkmap_page_table;
 extern void kmap_init(void);
 
 /*
- * The following functions are already defined by 
- * when CONFIG_HIGHMEM is not set.
+ * FIXME: The below looks broken vs. a kmap_atomic() in task context which
+ * is interupted and another kmap_atomic() happens in interrupt context.
+ * But what do I know about nds32. -- tglx
  */
-#ifdef CONFIG_HIGHMEM
-extern void *kmap_atomic_pfn(unsigned long pfn);
-#endif
+#define arch_kmap_local_post_map(vaddr, pteval)\
+   do {\
+   __nds32__tlbop_inv(vaddr);  \
+   __nds32__mtsr_dsb(vaddr, NDS32_SR_TLB_VPN); \
+   __nds32__tlbop_rwr(pteval); \
+   __nds32__isb(); \
+   } while (0)
+
+#define arch_kmap_local_pre_unmap(vaddr)   \
+   do {\
+   __nds32__tlbop_inv(vaddr);  \
+   __nds32__isb(); \
+   } while (0)
 
 #endif
--- a/arch/nds32/mm/Makefile
+++ b/arch/nds32/mm/Makefile
@@ -3,7 +3,6 @@ obj-y   := extable.o tlb.o fault.o init
mm-nds32.o cacheflush.o proc.o
 
 obj-$(CONFIG_ALIGNMENT_TRAP)   += alignment.o
-obj-$(CONFIG_HIGHMEM)   += highmem.o
 
 ifdef CONFIG_FUNCTION_TRACER
 CFLAGS_REMOVE_proc.o = $(CC_FLAGS_FTRACE)
--- a/arch/nds32/mm/highmem.c
+++ /dev/null
@@ -1,48 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0
-// Copyright (C) 2005-2017 Andes Technology Corporation
-
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-
-void *kmap_atomic_high_prot(struct page *page, pgprot_t prot)
-{
-   unsigned int idx;
-   unsigned long vaddr, pte;
-   int type;
-   pte_t *ptep;
-
-   type = kmap_atomic_idx_push();
-
-   idx = type + KM_TYPE_NR * smp_processor_id();
-   vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
-   pte = (page_to_pfn(page) << PAGE_SHIFT) | prot;
-   ptep = pte_offset_kernel(pmd_off_k(vaddr), vaddr);
-   set_pte(ptep, pte);
-
-   __nds32__tlbop_inv(vaddr);
-   __nds32__mtsr_dsb(vaddr, NDS32_SR_TLB_VPN);
-   __nds32__tlbop_rwr(pte);
-   __nds32__isb();
-   return (void *)vaddr;
-}
-EXPORT_SYMBOL(kmap_atomic_high_prot);
-
-void kunmap_atomic_high(void *kvaddr)
-{
-   if (kvaddr >= (void *)FIXADDR_START) {
-   unsigned long vaddr = (unsigned long)kvaddr;
-   pte_t *ptep;
-   kmap_atomic_idx_pop();
-   __nds32__tlbop_inv(vaddr);
-   __nds32__isb();
-   ptep = pte_offset_kernel(pmd_off_k(vaddr), vaddr);
-   set_pte(ptep, 0);
-   }
-}
-EXPORT_SYMBOL(kunmap_atomic_high);

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[patch V3 12/37] microblaze/mm/highmem: Switch to generic kmap atomic

2020-11-03 Thread Thomas Gleixner
No reason having the same code in every architecture.

Signed-off-by: Thomas Gleixner 
Cc: Michal Simek 
---
V3: Remove the kmap types cruft
---
 arch/microblaze/Kconfig   |1 
 arch/microblaze/include/asm/fixmap.h  |4 -
 arch/microblaze/include/asm/highmem.h |6 ++
 arch/microblaze/mm/Makefile   |1 
 arch/microblaze/mm/highmem.c  |   78 --
 arch/microblaze/mm/init.c |6 --
 6 files changed, 8 insertions(+), 88 deletions(-)

--- a/arch/microblaze/Kconfig
+++ b/arch/microblaze/Kconfig
@@ -155,6 +155,7 @@ config XILINX_UNCACHED_SHADOW
 config HIGHMEM
bool "High memory support"
depends on MMU
+   select KMAP_LOCAL
help
  The address space of Microblaze processors is only 4 Gigabytes large
  and it has to accommodate user address space, kernel address
--- a/arch/microblaze/include/asm/fixmap.h
+++ b/arch/microblaze/include/asm/fixmap.h
@@ -20,7 +20,7 @@
 #include 
 #ifdef CONFIG_HIGHMEM
 #include 
-#include 
+#include 
 #endif
 
 #define FIXADDR_TOP((unsigned long)(-PAGE_SIZE))
@@ -47,7 +47,7 @@ enum fixed_addresses {
FIX_HOLE,
 #ifdef CONFIG_HIGHMEM
FIX_KMAP_BEGIN, /* reserved pte's for temporary kernel mappings */
-   FIX_KMAP_END = FIX_KMAP_BEGIN + (KM_TYPE_NR * num_possible_cpus()) - 1,
+   FIX_KMAP_END = FIX_KMAP_BEGIN + (KM_MAX_IDX * num_possible_cpus()) - 1,
 #endif
__end_of_fixed_addresses
 };
--- a/arch/microblaze/include/asm/highmem.h
+++ b/arch/microblaze/include/asm/highmem.h
@@ -25,7 +25,6 @@
 #include 
 #include 
 
-extern pte_t *kmap_pte;
 extern pte_t *pkmap_page_table;
 
 /*
@@ -52,6 +51,11 @@ extern pte_t *pkmap_page_table;
 
 #define flush_cache_kmaps(){ flush_icache(); flush_dcache(); }
 
+#define arch_kmap_local_post_map(vaddr, pteval)\
+   local_flush_tlb_page(NULL, vaddr);
+#define arch_kmap_local_post_unmap(vaddr)  \
+   local_flush_tlb_page(NULL, vaddr);
+
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_HIGHMEM_H */
--- a/arch/microblaze/mm/Makefile
+++ b/arch/microblaze/mm/Makefile
@@ -6,4 +6,3 @@
 obj-y := consistent.o init.o
 
 obj-$(CONFIG_MMU) += pgtable.o mmu_context.o fault.o
-obj-$(CONFIG_HIGHMEM) += highmem.o
--- a/arch/microblaze/mm/highmem.c
+++ /dev/null
@@ -1,78 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0
-/*
- * highmem.c: virtual kernel memory mappings for high memory
- *
- * PowerPC version, stolen from the i386 version.
- *
- * Used in CONFIG_HIGHMEM systems for memory pages which
- * are not addressable by direct kernel virtual addresses.
- *
- * Copyright (C) 1999 Gerhard Wichert, Siemens AG
- *   gerhard.wich...@pdb.siemens.de
- *
- *
- * Redesigned the x86 32-bit VM architecture to deal with
- * up to 16 Terrabyte physical memory. With current x86 CPUs
- * we now support up to 64 Gigabytes physical RAM.
- *
- * Copyright (C) 1999 Ingo Molnar 
- *
- * Reworked for PowerPC by various contributors. Moved from
- * highmem.h by Benjamin Herrenschmidt (c) 2009 IBM Corp.
- */
-
-#include 
-#include 
-
-/*
- * The use of kmap_atomic/kunmap_atomic is discouraged - kmap/kunmap
- * gives a more generic (and caching) interface. But kmap_atomic can
- * be used in IRQ contexts, so in some (very limited) cases we need
- * it.
- */
-#include 
-
-void *kmap_atomic_high_prot(struct page *page, pgprot_t prot)
-{
-
-   unsigned long vaddr;
-   int idx, type;
-
-   type = kmap_atomic_idx_push();
-   idx = type + KM_TYPE_NR*smp_processor_id();
-   vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
-#ifdef CONFIG_DEBUG_HIGHMEM
-   BUG_ON(!pte_none(*(kmap_pte-idx)));
-#endif
-   set_pte_at(&init_mm, vaddr, kmap_pte-idx, mk_pte(page, prot));
-   local_flush_tlb_page(NULL, vaddr);
-
-   return (void *) vaddr;
-}
-EXPORT_SYMBOL(kmap_atomic_high_prot);
-
-void kunmap_atomic_high(void *kvaddr)
-{
-   unsigned long vaddr = (unsigned long) kvaddr & PAGE_MASK;
-   int type;
-   unsigned int idx;
-
-   if (vaddr < __fix_to_virt(FIX_KMAP_END))
-   return;
-
-   type = kmap_atomic_idx();
-
-   idx = type + KM_TYPE_NR * smp_processor_id();
-#ifdef CONFIG_DEBUG_HIGHMEM
-   BUG_ON(vaddr != __fix_to_virt(FIX_KMAP_BEGIN + idx));
-#endif
-   /*
-* force other mappings to Oops if they'll try to access
-* this pte without first remap it
-*/
-   pte_clear(&init_mm, vaddr, kmap_pte-idx);
-   local_flush_tlb_page(NULL, vaddr);
-
-   kmap_atomic_idx_pop();
-}
-EXPORT_SYMBOL(kunmap_atomic_high);
--- a/arch/microblaze/mm/init.c
+++ b/arch/microblaze/mm/init.c
@@ -49,17 +49,11 @@ unsigned long lowmem_size;
 EXPORT_SYMBOL(min_low_pfn);
 EXPORT_SYMBOL(max_low_pfn);
 
-#ifdef CONFIG_HIGHMEM
-pte_t *kmap_pte;
-EXPORT_SYMBOL(kmap_pte);
-
 static void __init highmem_init(void)
 {
pr_debug("%x\n", (u32)PKMAP_BASE);
map_page(PKMAP_BASE, 0, 0);

[patch V3 16/37] sparc/mm/highmem: Switch to generic kmap atomic

2020-11-03 Thread Thomas Gleixner
No reason having the same code in every architecture

Signed-off-by: Thomas Gleixner 
Cc: "David S. Miller" 
Cc: sparcli...@vger.kernel.org
---
V3: Remove the kmap types cruft
---
 arch/sparc/Kconfig  |1 
 arch/sparc/include/asm/highmem.h|8 +-
 arch/sparc/include/asm/kmap_types.h |   11 ---
 arch/sparc/include/asm/vaddrs.h |4 -
 arch/sparc/mm/Makefile  |3 
 arch/sparc/mm/highmem.c |  115 
 arch/sparc/mm/srmmu.c   |2 
 7 files changed, 8 insertions(+), 136 deletions(-)

--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -139,6 +139,7 @@ config MMU
 config HIGHMEM
bool
default y if SPARC32
+   select KMAP_LOCAL
 
 config ZONE_DMA
bool
--- a/arch/sparc/include/asm/highmem.h
+++ b/arch/sparc/include/asm/highmem.h
@@ -24,7 +24,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 
 /* declarations for highmem.c */
@@ -33,8 +32,6 @@ extern unsigned long highstart_pfn, high
 #define kmap_prot __pgprot(SRMMU_ET_PTE | SRMMU_PRIV | SRMMU_CACHE)
 extern pte_t *pkmap_page_table;
 
-void kmap_init(void) __init;
-
 /*
  * Right now we initialize only a single pte table. It can be extended
  * easily, subsequent pte tables have to be allocated in one physical
@@ -53,6 +50,11 @@ void kmap_init(void) __init;
 
 #define flush_cache_kmaps()flush_cache_all()
 
+/* FIXME: Use __flush_tlb_one(vaddr) instead of flush_cache_all() -- Anton */
+#define arch_kmap_local_post_map(vaddr, pteval)flush_cache_all()
+#define arch_kmap_local_post_unmap(vaddr)  flush_cache_all()
+
+
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_HIGHMEM_H */
--- a/arch/sparc/include/asm/kmap_types.h
+++ /dev/null
@@ -1,11 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _ASM_KMAP_TYPES_H
-#define _ASM_KMAP_TYPES_H
-
-/* Dummy header just to define km_type.  None of this
- * is actually used on sparc.  -DaveM
- */
-
-#include 
-
-#endif
--- a/arch/sparc/include/asm/vaddrs.h
+++ b/arch/sparc/include/asm/vaddrs.h
@@ -32,13 +32,13 @@
 #define SRMMU_NOCACHE_ALCRATIO 64  /* 256 pages per 64MB of system RAM */
 
 #ifndef __ASSEMBLY__
-#include 
+#include 
 
 enum fixed_addresses {
FIX_HOLE,
 #ifdef CONFIG_HIGHMEM
FIX_KMAP_BEGIN,
-   FIX_KMAP_END = (KM_TYPE_NR * NR_CPUS),
+   FIX_KMAP_END = (KM_MAX_IDX * NR_CPUS),
 #endif
__end_of_fixed_addresses
 };
--- a/arch/sparc/mm/Makefile
+++ b/arch/sparc/mm/Makefile
@@ -15,6 +15,3 @@ obj-$(CONFIG_SPARC32)   += leon_mm.o
 
 # Only used by sparc64
 obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o
-
-# Only used by sparc32
-obj-$(CONFIG_HIGHMEM)   += highmem.o
--- a/arch/sparc/mm/highmem.c
+++ /dev/null
@@ -1,115 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0
-/*
- *  highmem.c: virtual kernel memory mappings for high memory
- *
- *  Provides kernel-static versions of atomic kmap functions originally
- *  found as inlines in include/asm-sparc/highmem.h.  These became
- *  needed as kmap_atomic() and kunmap_atomic() started getting
- *  called from within modules.
- *  -- Tomas Szepe , September 2002
- *
- *  But kmap_atomic() and kunmap_atomic() cannot be inlined in
- *  modules because they are loaded with btfixup-ped functions.
- */
-
-/*
- * The use of kmap_atomic/kunmap_atomic is discouraged - kmap/kunmap
- * gives a more generic (and caching) interface. But kmap_atomic can
- * be used in IRQ contexts, so in some (very limited) cases we need it.
- *
- * XXX This is an old text. Actually, it's good to use atomic kmaps,
- * provided you remember that they are atomic and not try to sleep
- * with a kmap taken, much like a spinlock. Non-atomic kmaps are
- * shared by CPUs, and so precious, and establishing them requires IPI.
- * Atomic kmaps are lightweight and we may have NCPUS more of them.
- */
-#include 
-#include 
-#include 
-
-#include 
-#include 
-#include 
-
-static pte_t *kmap_pte;
-
-void __init kmap_init(void)
-{
-   unsigned long address = __fix_to_virt(FIX_KMAP_BEGIN);
-
-/* cache the first kmap pte */
-kmap_pte = virt_to_kpte(address);
-}
-
-void *kmap_atomic_high_prot(struct page *page, pgprot_t prot)
-{
-   unsigned long vaddr;
-   long idx, type;
-
-   type = kmap_atomic_idx_push();
-   idx = type + KM_TYPE_NR*smp_processor_id();
-   vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
-
-/* XXX Fix - Anton */
-#if 0
-   __flush_cache_one(vaddr);
-#else
-   flush_cache_all();
-#endif
-
-#ifdef CONFIG_DEBUG_HIGHMEM
-   BUG_ON(!pte_none(*(kmap_pte-idx)));
-#endif
-   set_pte(kmap_pte-idx, mk_pte(page, prot));
-/* XXX Fix - Anton */
-#if 0
-   __flush_tlb_one(vaddr);
-#else
-   flush_tlb_all();
-#endif
-
-   return (void*) vaddr;
-}
-EXPORT_SYMBOL(kmap_atomic_high_prot);
-
-void kunmap_atomic_high(void *kvaddr)
-{
-   unsigned long vaddr = (unsigned long) kvaddr & PAGE_MASK;
-   int type;
-
-   if (vaddr < FIXADDR_START)

[patch V3 23/37] sched: Make migrate_disable/enable() independent of RT

2020-11-03 Thread Thomas Gleixner
Now that the scheduler can deal with migrate disable properly, there is no
real compelling reason to make it only available for RT.

There are quite some code pathes which needlessly disable preemption in
order to prevent migration and some constructs like kmap_atomic() enforce
it implicitly.

Making it available independent of RT allows to provide a preemptible
variant of kmap_atomic() and makes the code more consistent in general.

FIXME: Rework the comment in preempt.h

Signed-off-by: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Juri Lelli 
Cc: Vincent Guittot 
Cc: Dietmar Eggemann 
Cc: Steven Rostedt 
Cc: Ben Segall 
Cc: Mel Gorman 
Cc: Daniel Bristot de Oliveira 
---
 include/linux/kernel.h  |   21 ++---
 include/linux/preempt.h |   38 +++---
 include/linux/sched.h   |2 +-
 kernel/sched/core.c |   45 +++--
 kernel/sched/sched.h|4 ++--
 lib/smp_processor_id.c  |2 +-
 6 files changed, 56 insertions(+), 56 deletions(-)

--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -204,6 +204,7 @@ extern int _cond_resched(void);
 extern void ___might_sleep(const char *file, int line, int preempt_offset);
 extern void __might_sleep(const char *file, int line, int preempt_offset);
 extern void __cant_sleep(const char *file, int line, int preempt_offset);
+extern void __cant_migrate(const char *file, int line);
 
 /**
  * might_sleep - annotation for functions that can sleep
@@ -227,6 +228,18 @@ extern void __cant_sleep(const char *fil
 # define cant_sleep() \
do { __cant_sleep(__FILE__, __LINE__, 0); } while (0)
 # define sched_annotate_sleep()(current->task_state_change = 0)
+
+/**
+ * cant_migrate - annotation for functions that cannot migrate
+ *
+ * Will print a stack trace if executed in code which is migratable
+ */
+# define cant_migrate()
\
+   do {\
+   if (IS_ENABLED(CONFIG_SMP)) \
+   __cant_migrate(__FILE__, __LINE__); \
+   } while (0)
+
 /**
  * non_block_start - annotate the start of section where sleeping is prohibited
  *
@@ -251,6 +264,7 @@ extern void __cant_sleep(const char *fil
   int preempt_offset) { }
 # define might_sleep() do { might_resched(); } while (0)
 # define cant_sleep() do { } while (0)
+# define cant_migrate()do { } while (0)
 # define sched_annotate_sleep() do { } while (0)
 # define non_block_start() do { } while (0)
 # define non_block_end() do { } while (0)
@@ -258,13 +272,6 @@ extern void __cant_sleep(const char *fil
 
 #define might_sleep_if(cond) do { if (cond) might_sleep(); } while (0)
 
-#ifndef CONFIG_PREEMPT_RT
-# define cant_migrate()cant_sleep()
-#else
-  /* Placeholder for now */
-# define cant_migrate()do { } while (0)
-#endif
-
 /**
  * abs - return absolute value of an argument
  * @x: the value.  If it is unsigned type, it is converted to signed type 
first.
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -322,7 +322,7 @@ static inline void preempt_notifier_init
 
 #endif
 
-#if defined(CONFIG_SMP) && defined(CONFIG_PREEMPT_RT)
+#ifdef CONFIG_SMP
 
 /*
  * Migrate-Disable and why it is undesired.
@@ -382,43 +382,11 @@ static inline void preempt_notifier_init
 extern void migrate_disable(void);
 extern void migrate_enable(void);
 
-#elif defined(CONFIG_PREEMPT_RT)
+#else
 
 static inline void migrate_disable(void) { }
 static inline void migrate_enable(void) { }
 
-#else /* !CONFIG_PREEMPT_RT */
-
-/**
- * migrate_disable - Prevent migration of the current task
- *
- * Maps to preempt_disable() which also disables preemption. Use
- * migrate_disable() to annotate that the intent is to prevent migration,
- * but not necessarily preemption.
- *
- * Can be invoked nested like preempt_disable() and needs the corresponding
- * number of migrate_enable() invocations.
- */
-static __always_inline void migrate_disable(void)
-{
-   preempt_disable();
-}
-
-/**
- * migrate_enable - Allow migration of the current task
- *
- * Counterpart to migrate_disable().
- *
- * As migrate_disable() can be invoked nested, only the outermost invocation
- * reenables migration.
- *
- * Currently mapped to preempt_enable().
- */
-static __always_inline void migrate_enable(void)
-{
-   preempt_enable();
-}
-
-#endif /* CONFIG_SMP && CONFIG_PREEMPT_RT */
+#endif /* CONFIG_SMP */
 
 #endif /* __LINUX_PREEMPT_H */
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -715,7 +715,7 @@ struct task_struct {
const cpumask_t *cpus_ptr;
cpumask_t   cpus_mask;
void*migration_pending;
-#if defined(CONFIG_SMP) && defined(CONFIG_PRE

[patch V3 24/37] sched: highmem: Store local kmaps in task struct

2020-11-03 Thread Thomas Gleixner
Instead of storing the map per CPU provide and use per task storage. That
prepares for local kmaps which are preemptible.

The context switch code is preparatory and not yet in use because
kmap_atomic() runs with preemption disabled. Will be made usable in the
next step.

The context switch logic is safe even when an interrupt happens after
clearing or before restoring the kmaps. The kmap index in task struct is
not modified so any nesting kmap in an interrupt will use unused indices
and on return the counter is the same as before.

Also add an assert into the return to user space code. Going back to user
space with an active kmap local is a nono.

Signed-off-by: Thomas Gleixner 
---
V3: Handle the debug case correctly
---
 include/linux/highmem-internal.h |   10 +++
 include/linux/sched.h|9 +++
 kernel/entry/common.c|2 
 kernel/fork.c|1 
 kernel/sched/core.c  |   18 +++
 mm/highmem.c |   99 +++
 6 files changed, 129 insertions(+), 10 deletions(-)

--- a/include/linux/highmem-internal.h
+++ b/include/linux/highmem-internal.h
@@ -9,6 +9,16 @@
 void *__kmap_local_pfn_prot(unsigned long pfn, pgprot_t prot);
 void *__kmap_local_page_prot(struct page *page, pgprot_t prot);
 void kunmap_local_indexed(void *vaddr);
+void kmap_local_fork(struct task_struct *tsk);
+void __kmap_local_sched_out(void);
+void __kmap_local_sched_in(void);
+static inline void kmap_assert_nomap(void)
+{
+   DEBUG_LOCKS_WARN_ON(current->kmap_ctrl.idx);
+}
+#else
+static inline void kmap_local_fork(struct task_struct *tsk) { }
+static inline void kmap_assert_nomap(void) { }
 #endif
 
 #ifdef CONFIG_HIGHMEM
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -34,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* task_struct member predeclarations (sorted alphabetically): */
 struct audit_context;
@@ -629,6 +630,13 @@ struct wake_q_node {
struct wake_q_node *next;
 };
 
+struct kmap_ctrl {
+#ifdef CONFIG_KMAP_LOCAL
+   int idx;
+   pte_t   pteval[KM_TYPE_NR];
+#endif
+};
+
 struct task_struct {
 #ifdef CONFIG_THREAD_INFO_IN_TASK
/*
@@ -1294,6 +1302,7 @@ struct task_struct {
unsigned intsequential_io;
unsigned intsequential_io_avg;
 #endif
+   struct kmap_ctrlkmap_ctrl;
 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
unsigned long   task_state_change;
 #endif
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -2,6 +2,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -194,6 +195,7 @@ static void exit_to_user_mode_prepare(st
 
/* Ensure that the address limit is intact and no locks are held */
addr_limit_user_check();
+   kmap_assert_nomap();
lockdep_assert_irqs_disabled();
lockdep_sys_exit();
 }
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -930,6 +930,7 @@ static struct task_struct *dup_task_stru
account_kernel_stack(tsk, 1);
 
kcov_task_init(tsk);
+   kmap_local_fork(tsk);
 
 #ifdef CONFIG_FAULT_INJECTION
tsk->fail_nth = 0;
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4053,6 +4053,22 @@ static inline void finish_lock_switch(st
 # define finish_arch_post_lock_switch()do { } while (0)
 #endif
 
+static inline void kmap_local_sched_out(void)
+{
+#ifdef CONFIG_KMAP_LOCAL
+   if (unlikely(current->kmap_ctrl.idx))
+   __kmap_local_sched_out();
+#endif
+}
+
+static inline void kmap_local_sched_in(void)
+{
+#ifdef CONFIG_KMAP_LOCAL
+   if (unlikely(current->kmap_ctrl.idx))
+   __kmap_local_sched_in();
+#endif
+}
+
 /**
  * prepare_task_switch - prepare to switch tasks
  * @rq: the runqueue preparing to switch
@@ -4075,6 +4091,7 @@ prepare_task_switch(struct rq *rq, struc
perf_event_task_sched_out(prev, next);
rseq_preempt(prev);
fire_sched_out_preempt_notifiers(prev, next);
+   kmap_local_sched_out();
prepare_task(next);
prepare_arch_switch(next);
 }
@@ -4141,6 +4158,7 @@ static struct rq *finish_task_switch(str
finish_lock_switch(rq);
finish_arch_post_lock_switch();
kcov_finish_switch(current);
+   kmap_local_sched_in();
 
fire_sched_in_preempt_notifiers(current);
/*
--- a/mm/highmem.c
+++ b/mm/highmem.c
@@ -365,8 +365,6 @@ EXPORT_SYMBOL(kunmap_high);
 
 #include 
 
-static DEFINE_PER_CPU(int, __kmap_local_idx);
-
 /*
  * With DEBUG_HIGHMEM the stack depth is doubled and every second
  * slot is unused which acts as a guard page
@@ -379,23 +377,21 @@ static DEFINE_PER_CPU(int, __kmap_local_
 
 static inline int kmap_local_idx_push(void)
 {
-   int idx = __this_cpu_add_return(__kmap_local_idx, KM_INCR) - 1;
-
WARN_ON_ONCE(in_irq() && !irqs_disabled());
-   BUG_ON(idx >= KM_MAX_I

[patch V3 00/37] mm/highmem: Preemptible variant of kmap_atomic & friends

2020-11-03 Thread Thomas Gleixner
Following up to the discussion in:

  https://lore.kernel.org/r/20200914204209.256266...@linutronix.de

and the second version of this:

  https://lore.kernel.org/r/20201029221806.189523...@linutronix.de

this series provides a preemptible variant of kmap_atomic & related
interfaces.

This is achieved by:

 - Removing the RT dependency from migrate_disable/enable()

 - Consolidating all kmap atomic implementations in generic code including
   a useful version of the CONFIG_DEBUG_HIGHMEM which provides guard pages
   between the individual maps instead of just increasing the map size.

 - Switching from per CPU storage of the kmap index to a per task storage

 - Adding a pteval array to the per task storage which contains the ptevals
   of the currently active temporary kmaps

 - Adding context switch code which checks whether the outgoing or the
   incoming task has active temporary kmaps. If so, the outgoing task's
   kmaps are removed and the incoming task's kmaps are restored.

 - Adding new interfaces k[un]map_local*() which are not disabling
   preemption and can be called from any context (except NMI).

   Contrary to kmap() which provides preemptible and "persistant" mappings,
   these interfaces are meant to replace the temporary mappings provided by
   kmap_atomic*() today.

This allows to get rid of conditional mapping choices and allows to have
preemptible short term mappings on 64bit which are today enforced to be
non-preemptible due to the highmem constraints. It clearly puts overhead on
the highmem users, but highmem is slow anyway.

This is not a wholesale conversion which makes kmap_atomic magically
preemptible because there might be usage sites which rely on the implicit
preempt disable. So this needs to be done on a case by case basis and the
call sites converted to kmap_local().

Note, that this is only tested on X86 and completely untested on all other
architectures (at least it compiles except on csky which does not compile
with the newest cross tools from kernel.org independent of this change).

The lot is available from

   git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git highmem

It is based on Peter Zijlstras migrate disable branch which is close to be
merged into the tip tree, but still not finalized:

   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git 
sched/migrate-disable

The series has the following parts:

Patches  1 - 22: Consolidation work which is independent of the scheduler
 changes

 79 files changed, 595 insertions(+), 1296 deletions(-)

Patch   23:  Needs to be folded back into the sched/migrate-disable

Patches 24 - 26: The preemptible kmap_local() implementation

 9 files changed, 283 insertions(+), 57 deletions(-)

Patches 27 - 37: Cleanup of the less common kmap/io_map_atomic users

 19 files changed, 114 insertions(+), 256 deletions(-)

Vs. merging this pile:

If everyone agrees, I'd like to take the first part (1-22) through tip so
that the preemptible implementation can be sorted in tip once the scheduler
prerequisites are there. The initial cleanups (27-37) might have to wait if
there are conflicts vs. the drm/gpu tree. We'll see.

>From what I can tell kmap_atomic() can be removed all together and
completly replaced by kmap_local(). Most of the usage sites are trivial and
just doing memcpy(), memset() or trivial operations on the temporarily
mapped page. The interesting ones are those which do either conditional
stuff or have copy_.*_user_inatomic() inside. As shown with the crash and
drm/gpu cleanups this allows to simplify the code quite a bit.

Changes vs. V2:

  - Remove the migrate disable from kmap_local and only issue that when the
there is an actual highmem mapping. (Linus)
  - Reordered the series so the consolidation is upfront
  - Get rid of kmap_types.h and the associated cruft
  - Fixup documentation and add function documentation for kmap_*
  - Splitout the internal implementation into a seperate header
  - More cleanups - removal of unused functions
  - Replace a few of the less frequently used kmap_atomic and
io_mapping_map_atomic variants and remove those interfaces.

Thanks,

tglx
---
 arch/alpha/include/asm/kmap_types.h   |   15 
 arch/arc/include/asm/kmap_types.h |   14 
 arch/arm/include/asm/kmap_types.h |   10 
 arch/arm/mm/highmem.c |  121 ---
 arch/ia64/include/asm/kmap_types.h|   13 
 arch/microblaze/mm/highmem.c  |   78 
 arch/mips/include/asm/kmap_types.h|   13 
 arch/nds32/mm/highmem.c   |   48 --
 arch/parisc/include/asm/kmap_types.h  |   13 
 arch/powerpc/include/asm/kmap_types.h |   13 
 arch/powerpc/mm/highmem.c |   67 
 arch/sh/include/as

[patch V3 15/37] powerpc/mm/highmem: Switch to generic kmap atomic

2020-11-03 Thread Thomas Gleixner
No reason having the same code in every architecture

Signed-off-by: Thomas Gleixner 
Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: linuxppc-...@lists.ozlabs.org
---
V3: Remove the kmap types cruft
---
 arch/powerpc/Kconfig  |1 
 arch/powerpc/include/asm/fixmap.h |4 +-
 arch/powerpc/include/asm/highmem.h|7 ++-
 arch/powerpc/include/asm/kmap_types.h |   13 --
 arch/powerpc/mm/Makefile  |1 
 arch/powerpc/mm/highmem.c |   67 --
 arch/powerpc/mm/mem.c |7 ---
 7 files changed, 8 insertions(+), 92 deletions(-)

--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -409,6 +409,7 @@ menu "Kernel options"
 config HIGHMEM
bool "High memory support"
depends on PPC32
+   select KMAP_LOCAL
 
 source "kernel/Kconfig.hz"
 
--- a/arch/powerpc/include/asm/fixmap.h
+++ b/arch/powerpc/include/asm/fixmap.h
@@ -20,7 +20,7 @@
 #include 
 #ifdef CONFIG_HIGHMEM
 #include 
-#include 
+#include 
 #endif
 
 #ifdef CONFIG_KASAN
@@ -55,7 +55,7 @@ enum fixed_addresses {
FIX_EARLY_DEBUG_BASE = FIX_EARLY_DEBUG_TOP+(ALIGN(SZ_128K, 
PAGE_SIZE)/PAGE_SIZE)-1,
 #ifdef CONFIG_HIGHMEM
FIX_KMAP_BEGIN, /* reserved pte's for temporary kernel mappings */
-   FIX_KMAP_END = FIX_KMAP_BEGIN+(KM_TYPE_NR*NR_CPUS)-1,
+   FIX_KMAP_END = FIX_KMAP_BEGIN + (KM_MAX_IDX * NR_CPUS) - 1,
 #endif
 #ifdef CONFIG_PPC_8xx
/* For IMMR we need an aligned 512K area */
--- a/arch/powerpc/include/asm/highmem.h
+++ b/arch/powerpc/include/asm/highmem.h
@@ -24,12 +24,10 @@
 #ifdef __KERNEL__
 
 #include 
-#include 
 #include 
 #include 
 #include 
 
-extern pte_t *kmap_pte;
 extern pte_t *pkmap_page_table;
 
 /*
@@ -60,6 +58,11 @@ extern pte_t *pkmap_page_table;
 
 #define flush_cache_kmaps()flush_cache_all()
 
+#define arch_kmap_local_post_map(vaddr, pteval)\
+   local_flush_tlb_page(NULL, vaddr)
+#define arch_kmap_local_post_unmap(vaddr)  \
+   local_flush_tlb_page(NULL, vaddr)
+
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_HIGHMEM_H */
--- a/arch/powerpc/include/asm/kmap_types.h
+++ /dev/null
@@ -1,13 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-or-later */
-#ifndef _ASM_POWERPC_KMAP_TYPES_H
-#define _ASM_POWERPC_KMAP_TYPES_H
-
-#ifdef __KERNEL__
-
-/*
- */
-
-#define KM_TYPE_NR 16
-
-#endif /* __KERNEL__ */
-#endif /* _ASM_POWERPC_KMAP_TYPES_H */
--- a/arch/powerpc/mm/Makefile
+++ b/arch/powerpc/mm/Makefile
@@ -16,7 +16,6 @@ obj-$(CONFIG_NEED_MULTIPLE_NODES) += num
 obj-$(CONFIG_PPC_MM_SLICES)+= slice.o
 obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o
 obj-$(CONFIG_NOT_COHERENT_CACHE) += dma-noncoherent.o
-obj-$(CONFIG_HIGHMEM)  += highmem.o
 obj-$(CONFIG_PPC_COPRO_BASE)   += copro_fault.o
 obj-$(CONFIG_PPC_PTDUMP)   += ptdump/
 obj-$(CONFIG_KASAN)+= kasan/
--- a/arch/powerpc/mm/highmem.c
+++ /dev/null
@@ -1,67 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0
-/*
- * highmem.c: virtual kernel memory mappings for high memory
- *
- * PowerPC version, stolen from the i386 version.
- *
- * Used in CONFIG_HIGHMEM systems for memory pages which
- * are not addressable by direct kernel virtual addresses.
- *
- * Copyright (C) 1999 Gerhard Wichert, Siemens AG
- *   gerhard.wich...@pdb.siemens.de
- *
- *
- * Redesigned the x86 32-bit VM architecture to deal with
- * up to 16 Terrabyte physical memory. With current x86 CPUs
- * we now support up to 64 Gigabytes physical RAM.
- *
- * Copyright (C) 1999 Ingo Molnar 
- *
- * Reworked for PowerPC by various contributors. Moved from
- * highmem.h by Benjamin Herrenschmidt (c) 2009 IBM Corp.
- */
-
-#include 
-#include 
-
-void *kmap_atomic_high_prot(struct page *page, pgprot_t prot)
-{
-   unsigned long vaddr;
-   int idx, type;
-
-   type = kmap_atomic_idx_push();
-   idx = type + KM_TYPE_NR*smp_processor_id();
-   vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
-   WARN_ON(IS_ENABLED(CONFIG_DEBUG_HIGHMEM) && !pte_none(*(kmap_pte - 
idx)));
-   __set_pte_at(&init_mm, vaddr, kmap_pte-idx, mk_pte(page, prot), 1);
-   local_flush_tlb_page(NULL, vaddr);
-
-   return (void*) vaddr;
-}
-EXPORT_SYMBOL(kmap_atomic_high_prot);
-
-void kunmap_atomic_high(void *kvaddr)
-{
-   unsigned long vaddr = (unsigned long) kvaddr & PAGE_MASK;
-
-   if (vaddr < __fix_to_virt(FIX_KMAP_END))
-   return;
-
-   if (IS_ENABLED(CONFIG_DEBUG_HIGHMEM)) {
-   int type = kmap_atomic_idx();
-   unsigned int idx;
-
-   idx = type + KM_TYPE_NR * smp_processor_id();
-   WARN_ON(vaddr != __fix_to_virt(FIX_KMAP_BEGIN + idx));
-
-   /*
-* force other mappings to Oops if they'll try to access
-* this pte without first remap it
-*/
-   pte_clear(&init_mm, vaddr, 

[patch V3 01/37] mm/highmem: Un-EXPORT __kmap_atomic_idx()

2020-11-03 Thread Thomas Gleixner
Nothing in modules can use that.

Signed-off-by: Thomas Gleixner 
Reviewed-by: Christoph Hellwig 
Cc: Andrew Morton 
Cc: linux...@kvack.org
---
 mm/highmem.c |2 --
 1 file changed, 2 deletions(-)

--- a/mm/highmem.c
+++ b/mm/highmem.c
@@ -108,8 +108,6 @@ static inline wait_queue_head_t *get_pkm
 atomic_long_t _totalhigh_pages __read_mostly;
 EXPORT_SYMBOL(_totalhigh_pages);
 
-EXPORT_PER_CPU_SYMBOL(__kmap_atomic_idx);
-
 unsigned int nr_free_highpages (void)
 {
struct zone *zone;

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[patch V3 03/37] fs: Remove asm/kmap_types.h includes

2020-11-03 Thread Thomas Gleixner
Historical leftovers from the time where kmap() had fixed slots.

Signed-off-by: Thomas Gleixner 
Cc: Alexander Viro 
Cc: Benjamin LaHaise 
Cc: linux-fsde...@vger.kernel.org
Cc: linux-...@kvack.org
Cc: Chris Mason 
Cc: Josef Bacik 
Cc: David Sterba 
Cc: linux-bt...@vger.kernel.org
---
 fs/aio.c |1 -
 fs/btrfs/ctree.h |1 -
 2 files changed, 2 deletions(-)

--- a/fs/aio.c
+++ b/fs/aio.c
@@ -43,7 +43,6 @@
 #include 
 #include 
 
-#include 
 #include 
 #include 
 
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -17,7 +17,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[patch V3 05/37] asm-generic: Provide kmap_size.h

2020-11-03 Thread Thomas Gleixner
kmap_types.h is a misnomer because the old atomic MAP based array does not
exist anymore and the whole indirection of architectures including
kmap_types.h is inconinstent and does not allow to provide guard page
debugging for this misfeature.

Add a common header file which defines the mapping stack size for all
architectures. Will be used when converting architectures over to a
generic kmap_local/atomic implementation.

The array size is chosen with the following constraints in mind:

- The deepest nest level in one context is 3 according to code
  inspection.

- The worst case nesting for the upcoming reemptible version would be:

  2 maps in task context and a fault inside
  2 maps in the fault handler
  3 maps in softirq
  2 maps in interrupt

So a total of 16 is sufficient and probably overestimated.

Signed-off-by: Thomas Gleixner 
---
V3: New patch
---
 include/asm-generic/Kbuild  |1 +
 include/asm-generic/kmap_size.h |   12 
 2 files changed, 13 insertions(+)

--- a/include/asm-generic/Kbuild
+++ b/include/asm-generic/Kbuild
@@ -31,6 +31,7 @@ mandatory-y += irq_regs.h
 mandatory-y += irq_work.h
 mandatory-y += kdebug.h
 mandatory-y += kmap_types.h
+mandatory-y += kmap_size.h
 mandatory-y += kprobes.h
 mandatory-y += linkage.h
 mandatory-y += local.h
--- /dev/null
+++ b/include/asm-generic/kmap_size.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_GENERIC_KMAP_SIZE_H
+#define _ASM_GENERIC_KMAP_SIZE_H
+
+/* For debug this provides guard pages between the maps */
+#ifdef CONFIG_DEBUG_HIGHMEM
+# define KM_MAX_IDX33
+#else
+# define KM_MAX_IDX16
+#endif
+
+#endif

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[patch V3 04/37] sh/highmem: Remove all traces of unused cruft

2020-11-03 Thread Thomas Gleixner
For whatever reasons SH has highmem bits all over the place but does
not enable it via Kconfig. Remove the bitrot.

Signed-off-by: Thomas Gleixner 
---
 arch/sh/include/asm/fixmap.h |8 
 arch/sh/include/asm/kmap_types.h |   15 ---
 arch/sh/mm/init.c|8 
 3 files changed, 31 deletions(-)

--- a/arch/sh/include/asm/fixmap.h
+++ b/arch/sh/include/asm/fixmap.h
@@ -13,9 +13,6 @@
 #include 
 #include 
 #include 
-#ifdef CONFIG_HIGHMEM
-#include 
-#endif
 
 /*
  * Here we define all the compile-time 'special' virtual
@@ -53,11 +50,6 @@ enum fixed_addresses {
FIX_CMAP_BEGIN,
FIX_CMAP_END = FIX_CMAP_BEGIN + (FIX_N_COLOURS * NR_CPUS) - 1,
 
-#ifdef CONFIG_HIGHMEM
-   FIX_KMAP_BEGIN, /* reserved pte's for temporary kernel mappings */
-   FIX_KMAP_END = FIX_KMAP_BEGIN + (KM_TYPE_NR * NR_CPUS) - 1,
-#endif
-
 #ifdef CONFIG_IOREMAP_FIXED
/*
 * FIX_IOREMAP entries are useful for mapping physical address
--- a/arch/sh/include/asm/kmap_types.h
+++ /dev/null
@@ -1,15 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef __SH_KMAP_TYPES_H
-#define __SH_KMAP_TYPES_H
-
-/* Dummy header just to define km_type. */
-
-#ifdef CONFIG_DEBUG_HIGHMEM
-#define  __WITH_KM_FENCE
-#endif
-
-#include 
-
-#undef __WITH_KM_FENCE
-
-#endif
--- a/arch/sh/mm/init.c
+++ b/arch/sh/mm/init.c
@@ -362,9 +362,6 @@ void __init mem_init(void)
mem_init_print_info(NULL);
pr_info("virtual kernel memory layout:\n"
"fixmap  : 0x%08lx - 0x%08lx   (%4ld kB)\n"
-#ifdef CONFIG_HIGHMEM
-   "pkmap   : 0x%08lx - 0x%08lx   (%4ld kB)\n"
-#endif
"vmalloc : 0x%08lx - 0x%08lx   (%4ld MB)\n"
"lowmem  : 0x%08lx - 0x%08lx   (%4ld MB) (cached)\n"
 #ifdef CONFIG_UNCACHED_MAPPING
@@ -376,11 +373,6 @@ void __init mem_init(void)
FIXADDR_START, FIXADDR_TOP,
(FIXADDR_TOP - FIXADDR_START) >> 10,
 
-#ifdef CONFIG_HIGHMEM
-   PKMAP_BASE, PKMAP_BASE+LAST_PKMAP*PAGE_SIZE,
-   (LAST_PKMAP*PAGE_SIZE) >> 10,
-#endif
-
(unsigned long)VMALLOC_START, VMALLOC_END,
(VMALLOC_END - VMALLOC_START) >> 20,
 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[patch V3 08/37] x86/mm/highmem: Use generic kmap atomic implementation

2020-11-03 Thread Thomas Gleixner
Convert X86 to the generic kmap atomic implementation and make the
iomap_atomic() naming convention consistent while at it.

Signed-off-by: Thomas Gleixner 
Cc: x...@kernel.org
---
V3: Remove the kmap_types cruft
---
 arch/x86/Kconfig  |3 +
 arch/x86/include/asm/fixmap.h |5 +-
 arch/x86/include/asm/highmem.h|   13 +--
 arch/x86/include/asm/iomap.h  |   18 +-
 arch/x86/include/asm/kmap_types.h |   13 ---
 arch/x86/include/asm/paravirt_types.h |1 
 arch/x86/mm/highmem_32.c  |   59 --
 arch/x86/mm/init_32.c |   15 
 arch/x86/mm/iomap_32.c|   59 ++
 include/linux/highmem.h   |2 -
 include/linux/io-mapping.h|2 -
 mm/highmem.c  |2 -
 12 files changed, 31 insertions(+), 161 deletions(-)

--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -14,10 +14,11 @@ config X86_32
select ARCH_WANT_IPC_PARSE_VERSION
select CLKSRC_I8253
select CLONE_BACKWARDS
+   select GENERIC_VDSO_32
select HAVE_DEBUG_STACKOVERFLOW
+   select KMAP_LOCAL
select MODULES_USE_ELF_REL
select OLD_SIGACTION
-   select GENERIC_VDSO_32
 
 config X86_64
def_bool y
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -31,7 +31,7 @@
 #include 
 #ifdef CONFIG_X86_32
 #include 
-#include 
+#include 
 #else
 #include 
 #endif
@@ -94,7 +94,7 @@ enum fixed_addresses {
 #endif
 #ifdef CONFIG_X86_32
FIX_KMAP_BEGIN, /* reserved pte's for temporary kernel mappings */
-   FIX_KMAP_END = FIX_KMAP_BEGIN+(KM_TYPE_NR*NR_CPUS)-1,
+   FIX_KMAP_END = FIX_KMAP_BEGIN + (KM_MAX_IDX * NR_CPUS) - 1,
 #ifdef CONFIG_PCI_MMCONFIG
FIX_PCIE_MCFG,
 #endif
@@ -151,7 +151,6 @@ extern void reserve_top_address(unsigned
 
 extern int fixmaps_set;
 
-extern pte_t *kmap_pte;
 extern pte_t *pkmap_page_table;
 
 void __native_set_fixmap(enum fixed_addresses idx, pte_t pte);
--- a/arch/x86/include/asm/highmem.h
+++ b/arch/x86/include/asm/highmem.h
@@ -23,7 +23,6 @@
 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -58,11 +57,17 @@ extern unsigned long highstart_pfn, high
 #define PKMAP_NR(virt)  ((virt-PKMAP_BASE) >> PAGE_SHIFT)
 #define PKMAP_ADDR(nr)  (PKMAP_BASE + ((nr) << PAGE_SHIFT))
 
-void *kmap_atomic_pfn(unsigned long pfn);
-void *kmap_atomic_prot_pfn(unsigned long pfn, pgprot_t prot);
-
 #define flush_cache_kmaps()do { } while (0)
 
+#definearch_kmap_local_post_map(vaddr, pteval) \
+   arch_flush_lazy_mmu_mode()
+
+#definearch_kmap_local_post_unmap(vaddr)   \
+   do {\
+   flush_tlb_one_kernel((vaddr));  \
+   arch_flush_lazy_mmu_mode(); \
+   } while (0)
+
 extern void add_highpages_with_active_regions(int nid, unsigned long start_pfn,
unsigned long end_pfn);
 
--- a/arch/x86/include/asm/iomap.h
+++ b/arch/x86/include/asm/iomap.h
@@ -9,19 +9,21 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
-void __iomem *
-iomap_atomic_prot_pfn(unsigned long pfn, pgprot_t prot);
+void __iomem *iomap_atomic_pfn_prot(unsigned long pfn, pgprot_t prot);
 
-void
-iounmap_atomic(void __iomem *kvaddr);
+static inline void iounmap_atomic(void __iomem *vaddr)
+{
+   kunmap_local_indexed((void __force *)vaddr);
+   pagefault_enable();
+   preempt_enable();
+}
 
-int
-iomap_create_wc(resource_size_t base, unsigned long size, pgprot_t *prot);
+int iomap_create_wc(resource_size_t base, unsigned long size, pgprot_t *prot);
 
-void
-iomap_free(resource_size_t base, unsigned long size);
+void iomap_free(resource_size_t base, unsigned long size);
 
 #endif /* _ASM_X86_IOMAP_H */
--- a/arch/x86/include/asm/kmap_types.h
+++ /dev/null
@@ -1,13 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _ASM_X86_KMAP_TYPES_H
-#define _ASM_X86_KMAP_TYPES_H
-
-#if defined(CONFIG_X86_32) && defined(CONFIG_DEBUG_HIGHMEM)
-#define  __WITH_KM_FENCE
-#endif
-
-#include 
-
-#undef __WITH_KM_FENCE
-
-#endif /* _ASM_X86_KMAP_TYPES_H */
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -41,7 +41,6 @@
 #ifndef __ASSEMBLY__
 
 #include 
-#include 
 #include 
 #include 
 
--- a/arch/x86/mm/highmem_32.c
+++ b/arch/x86/mm/highmem_32.c
@@ -4,65 +4,6 @@
 #include  /* for totalram_pages */
 #include 
 
-void *kmap_atomic_high_prot(struct page *page, pgprot_t prot)
-{
-   unsigned long vaddr;
-   int idx, type;
-
-   type = kmap_atomic_idx_push();
-   idx = type + KM_TYPE_NR*smp_processor_id();
-   vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
-   BUG_ON(!pte_none(*(kmap_pte-idx)));
-   set_pte(kmap_pte-idx, mk_pte(page, prot));
-   arch_flush_lazy_mmu

[patch V3 07/37] highmem: Make DEBUG_HIGHMEM functional

2020-11-03 Thread Thomas Gleixner
For some obscure reason when CONFIG_DEBUG_HIGHMEM is enabled the stack
depth is increased from 20 to 41. But the only thing DEBUG_HIGHMEM does is
to enable a few BUG_ON()'s in the mapping code.

That's a leftover from the historical mapping code which had fixed entries
for various purposes. DEBUG_HIGHMEM inserted guard mappings between the map
types. But that got all ditched when kmap_atomic() switched to a stack
based map management. Though the WITH_KM_FENCE magic survived without being
functional. All the thing does today is to increase the stack depth.

Add a working implementation to the generic kmap_local* implementation.

Signed-off-by: Thomas Gleixner 
---
V3: New patch
---
 mm/highmem.c |   14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

--- a/mm/highmem.c
+++ b/mm/highmem.c
@@ -374,9 +374,19 @@ EXPORT_SYMBOL(kunmap_high);
 
 static DEFINE_PER_CPU(int, __kmap_local_idx);
 
+/*
+ * With DEBUG_HIGHMEM the stack depth is doubled and every second
+ * slot is unused which acts as a guard page
+ */
+#ifdef CONFIG_DEBUG_HIGHMEM
+# define KM_INCR   2
+#else
+# define KM_INCR   1
+#endif
+
 static inline int kmap_local_idx_push(void)
 {
-   int idx = __this_cpu_inc_return(__kmap_local_idx) - 1;
+   int idx = __this_cpu_add_return(__kmap_local_idx, KM_INCR) - 1;
 
WARN_ON_ONCE(in_irq() && !irqs_disabled());
BUG_ON(idx >= KM_MAX_IDX);
@@ -390,7 +400,7 @@ static inline int kmap_local_idx(void)
 
 static inline void kmap_local_idx_pop(void)
 {
-   int idx = __this_cpu_dec_return(__kmap_local_idx);
+   int idx = __this_cpu_sub_return(__kmap_local_idx, KM_INCR);
 
BUG_ON(idx < 0);
 }

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[patch V3 02/37] highmem: Remove unused functions

2020-11-03 Thread Thomas Gleixner
Nothing uses totalhigh_pages_dec() and totalhigh_pages_set().

Signed-off-by: Thomas Gleixner 
---
V3: New patch
---
 include/linux/highmem.h |   10 --
 1 file changed, 10 deletions(-)

--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -104,21 +104,11 @@ static inline void totalhigh_pages_inc(v
atomic_long_inc(&_totalhigh_pages);
 }
 
-static inline void totalhigh_pages_dec(void)
-{
-   atomic_long_dec(&_totalhigh_pages);
-}
-
 static inline void totalhigh_pages_add(long count)
 {
atomic_long_add(count, &_totalhigh_pages);
 }
 
-static inline void totalhigh_pages_set(long val)
-{
-   atomic_long_set(&_totalhigh_pages, val);
-}
-
 void kmap_flush_unused(void);
 
 struct page *kmap_to_page(void *addr);

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[patch V3 10/37] ARM: highmem: Switch to generic kmap atomic

2020-11-03 Thread Thomas Gleixner
No reason having the same code in every architecture.

Signed-off-by: Thomas Gleixner 
Cc: Russell King 
Cc: Arnd Bergmann 
Cc: linux-arm-ker...@lists.infradead.org
---
V3: Remove the kmap types cruft
---
 arch/arm/Kconfig  |1 
 arch/arm/include/asm/fixmap.h |4 -
 arch/arm/include/asm/highmem.h|   33 +++---
 arch/arm/include/asm/kmap_types.h |   10 ---
 arch/arm/mm/Makefile  |1 
 arch/arm/mm/highmem.c |  121 --
 6 files changed, 26 insertions(+), 144 deletions(-)

--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -1498,6 +1498,7 @@ config HAVE_ARCH_PFN_VALID
 config HIGHMEM
bool "High Memory Support"
depends on MMU
+   select KMAP_LOCAL
help
  The address space of ARM processors is only 4 Gigabytes large
  and it has to accommodate user address space, kernel address
--- a/arch/arm/include/asm/fixmap.h
+++ b/arch/arm/include/asm/fixmap.h
@@ -7,14 +7,14 @@
 #define FIXADDR_TOP(FIXADDR_END - PAGE_SIZE)
 
 #include 
-#include 
+#include 
 
 enum fixed_addresses {
FIX_EARLYCON_MEM_BASE,
__end_of_permanent_fixed_addresses,
 
FIX_KMAP_BEGIN = __end_of_permanent_fixed_addresses,
-   FIX_KMAP_END = FIX_KMAP_BEGIN + (KM_TYPE_NR * NR_CPUS) - 1,
+   FIX_KMAP_END = FIX_KMAP_BEGIN + (KM_MAX_IDX * NR_CPUS) - 1,
 
/* Support writing RO kernel text via kprobes, jump labels, etc. */
FIX_TEXT_POKE0,
--- a/arch/arm/include/asm/highmem.h
+++ b/arch/arm/include/asm/highmem.h
@@ -2,7 +2,7 @@
 #ifndef _ASM_HIGHMEM_H
 #define _ASM_HIGHMEM_H
 
-#include 
+#include 
 
 #define PKMAP_BASE (PAGE_OFFSET - PMD_SIZE)
 #define LAST_PKMAP PTRS_PER_PTE
@@ -46,19 +46,32 @@ extern pte_t *pkmap_page_table;
 
 #ifdef ARCH_NEEDS_KMAP_HIGH_GET
 extern void *kmap_high_get(struct page *page);
-#else
+
+static inline void *arch_kmap_local_high_get(struct page *page)
+{
+   if (IS_ENABLED(CONFIG_DEBUG_HIGHMEM) && !cache_is_vivt())
+   return NULL;
+   return kmap_high_get(page);
+}
+#define arch_kmap_local_high_get arch_kmap_local_high_get
+
+#else /* ARCH_NEEDS_KMAP_HIGH_GET */
 static inline void *kmap_high_get(struct page *page)
 {
return NULL;
 }
-#endif
+#endif /* !ARCH_NEEDS_KMAP_HIGH_GET */
 
-/*
- * The following functions are already defined by 
- * when CONFIG_HIGHMEM is not set.
- */
-#ifdef CONFIG_HIGHMEM
-extern void *kmap_atomic_pfn(unsigned long pfn);
-#endif
+#define arch_kmap_local_post_map(vaddr, pteval)
\
+   local_flush_tlb_kernel_page(vaddr)
+
+#define arch_kmap_local_pre_unmap(vaddr)   \
+do {   \
+   if (cache_is_vivt())\
+   __cpuc_flush_dcache_area((void *)vaddr, PAGE_SIZE); \
+} while (0)
+
+#define arch_kmap_local_post_unmap(vaddr)  \
+   local_flush_tlb_kernel_page(vaddr)
 
 #endif
--- a/arch/arm/include/asm/kmap_types.h
+++ /dev/null
@@ -1,10 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef __ARM_KMAP_TYPES_H
-#define __ARM_KMAP_TYPES_H
-
-/*
- * This is the "bare minimum".  AIO seems to require this.
- */
-#define KM_TYPE_NR 16
-
-#endif
--- a/arch/arm/mm/Makefile
+++ b/arch/arm/mm/Makefile
@@ -19,7 +19,6 @@ obj-$(CONFIG_MODULES) += proc-syms.o
 obj-$(CONFIG_DEBUG_VIRTUAL)+= physaddr.o
 
 obj-$(CONFIG_ALIGNMENT_TRAP)   += alignment.o
-obj-$(CONFIG_HIGHMEM)  += highmem.o
 obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o
 obj-$(CONFIG_ARM_PV_FIXUP) += pv-fixup-asm.o
 
--- a/arch/arm/mm/highmem.c
+++ /dev/null
@@ -1,121 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0-only
-/*
- * arch/arm/mm/highmem.c -- ARM highmem support
- *
- * Author: Nicolas Pitre
- * Created:september 8, 2008
- * Copyright:  Marvell Semiconductors Inc.
- */
-
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include "mm.h"
-
-static inline void set_fixmap_pte(int idx, pte_t pte)
-{
-   unsigned long vaddr = __fix_to_virt(idx);
-   pte_t *ptep = virt_to_kpte(vaddr);
-
-   set_pte_ext(ptep, pte, 0);
-   local_flush_tlb_kernel_page(vaddr);
-}
-
-static inline pte_t get_fixmap_pte(unsigned long vaddr)
-{
-   pte_t *ptep = virt_to_kpte(vaddr);
-
-   return *ptep;
-}
-
-void *kmap_atomic_high_prot(struct page *page, pgprot_t prot)
-{
-   unsigned int idx;
-   unsigned long vaddr;
-   void *kmap;
-   int type;
-
-#ifdef CONFIG_DEBUG_HIGHMEM
-   /*
-* There is no cache coherency issue when non VIVT, so force the
-* dedicated kmap usage for better debugging purposes in that case.
-*/
-   if (!cache_is_vivt())
-   kmap = NULL;
-   else
-#endif
-   kmap = kmap_high_get(page);
-   if (kmap)
-

[patch V3 06/37] highmem: Provide generic variant of kmap_atomic*

2020-11-03 Thread Thomas Gleixner
The kmap_atomic* interfaces in all architectures are pretty much the same
except for post map operations (flush) and pre- and post unmap operations.

Provide a generic variant for that.

Signed-off-by: Thomas Gleixner 
Cc: Andrew Morton 
Cc: linux...@kvack.org
---
V3: Do not reuse the kmap_atomic_idx pile and use kmap_size.h right away
V2: Address review comments from Christoph (style and EXPORT variant)
---
 include/linux/highmem.h |   82 ++-
 mm/Kconfig  |3 +
 mm/highmem.c|  144 +++-
 3 files changed, 211 insertions(+), 18 deletions(-)

--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -31,9 +31,16 @@ static inline void invalidate_kernel_vma
 
 #include 
 
+/*
+ * Outside of CONFIG_HIGHMEM to support X86 32bit iomap_atomic() cruft.
+ */
+#ifdef CONFIG_KMAP_LOCAL
+void *__kmap_local_pfn_prot(unsigned long pfn, pgprot_t prot);
+void *__kmap_local_page_prot(struct page *page, pgprot_t prot);
+void kunmap_local_indexed(void *vaddr);
+#endif
+
 #ifdef CONFIG_HIGHMEM
-extern void *kmap_atomic_high_prot(struct page *page, pgprot_t prot);
-extern void kunmap_atomic_high(void *kvaddr);
 #include 
 
 #ifndef ARCH_HAS_KMAP_FLUSH_TLB
@@ -81,6 +88,11 @@ static inline void kunmap(struct page *p
  * be used in IRQ contexts, so in some (very limited) cases we need
  * it.
  */
+
+#ifndef CONFIG_KMAP_LOCAL
+void *kmap_atomic_high_prot(struct page *page, pgprot_t prot);
+void kunmap_atomic_high(void *kvaddr);
+
 static inline void *kmap_atomic_prot(struct page *page, pgprot_t prot)
 {
preempt_disable();
@@ -89,7 +101,38 @@ static inline void *kmap_atomic_prot(str
return page_address(page);
return kmap_atomic_high_prot(page, prot);
 }
-#define kmap_atomic(page)  kmap_atomic_prot(page, kmap_prot)
+
+static inline void __kunmap_atomic(void *vaddr)
+{
+   kunmap_atomic_high(vaddr);
+}
+#else /* !CONFIG_KMAP_LOCAL */
+
+static inline void *kmap_atomic_prot(struct page *page, pgprot_t prot)
+{
+   preempt_disable();
+   pagefault_disable();
+   return __kmap_local_page_prot(page, prot);
+}
+
+static inline void *kmap_atomic_pfn(unsigned long pfn)
+{
+   preempt_disable();
+   pagefault_disable();
+   return __kmap_local_pfn_prot(pfn, kmap_prot);
+}
+
+static inline void __kunmap_atomic(void *addr)
+{
+   kunmap_local_indexed(addr);
+}
+
+#endif /* CONFIG_KMAP_LOCAL */
+
+static inline void *kmap_atomic(struct page *page)
+{
+   return kmap_atomic_prot(page, kmap_prot);
+}
 
 /* declarations for linux/mm/highmem.c */
 unsigned int nr_free_highpages(void);
@@ -147,25 +190,33 @@ static inline void *kmap_atomic(struct p
pagefault_disable();
return page_address(page);
 }
-#define kmap_atomic_prot(page, prot)   kmap_atomic(page)
 
-static inline void kunmap_atomic_high(void *addr)
+static inline void *kmap_atomic_prot(struct page *page, pgprot_t prot)
+{
+   return kmap_atomic(page);
+}
+
+static inline void *kmap_atomic_pfn(unsigned long pfn)
+{
+   return kmap_atomic(pfn_to_page(pfn));
+}
+
+static inline void __kunmap_atomic(void *addr)
 {
/*
 * Mostly nothing to do in the CONFIG_HIGHMEM=n case as kunmap_atomic()
-* handles re-enabling faults + preemption
+* handles re-enabling faults and preemption
 */
 #ifdef ARCH_HAS_FLUSH_ON_KUNMAP
kunmap_flush_on_unmap(addr);
 #endif
 }
 
-#define kmap_atomic_pfn(pfn)   kmap_atomic(pfn_to_page(pfn))
-
 #define kmap_flush_unused()do {} while(0)
 
 #endif /* CONFIG_HIGHMEM */
 
+#if !defined(CONFIG_KMAP_LOCAL)
 #if defined(CONFIG_HIGHMEM) || defined(CONFIG_X86_32)
 
 DECLARE_PER_CPU(int, __kmap_atomic_idx);
@@ -196,22 +247,21 @@ static inline void kmap_atomic_idx_pop(v
__this_cpu_dec(__kmap_atomic_idx);
 #endif
 }
-
+#endif
 #endif
 
 /*
  * Prevent people trying to call kunmap_atomic() as if it were kunmap()
  * kunmap_atomic() should get the return value of kmap_atomic, not the page.
  */
-#define kunmap_atomic(addr) \
-do {\
-   BUILD_BUG_ON(__same_type((addr), struct page *));   \
-   kunmap_atomic_high(addr);  \
-   pagefault_enable(); \
-   preempt_enable();   \
+#define kunmap_atomic(__addr)  \
+do {   \
+   BUILD_BUG_ON(__same_type((__addr), struct page *)); \
+   __kunmap_atomic(__addr);\
+   pagefault_enable(); \
+   preempt_enable();   \
 } while (0)
 
-
 /* when CONFIG_HIGHMEM is not set these will be plain clear/copy_page */
 #ifndef clear_user_highpage
 static inline void clear_user_highpage

Re: [RFC PATCH 00/26] Runtime paravirt patching

2020-04-08 Thread Thomas Gleixner
Ankur Arora  writes:
> A KVM host (or another hypervisor) might advertise paravirtualized
> features and optimization hints (ex KVM_HINTS_REALTIME) which might
> become stale over the lifetime of the guest. For instance, the
> host might go from being undersubscribed to being oversubscribed
> (or the other way round) and it would make sense for the guest
> switch pv-ops based on that.

If your host changes his advertised behaviour then you want to fix the
host setup or find a competent admin.

> This lockorture splat that I saw on the guest while testing this is
> indicative of the problem:
>
>   [ 1136.461522] watchdog: BUG: soft lockup - CPU#8 stuck for 22s! 
> [lock_torture_wr:12865]
>   [ 1136.461542] CPU: 8 PID: 12865 Comm: lock_torture_wr Tainted: G W L 
> 5.4.0-rc7+ #77
>   [ 1136.461546] RIP: 0010:native_queued_spin_lock_slowpath+0x15/0x220
>
> (Caused by an oversubscribed host but using mismatched native pv_lock_ops
> on the gues.)

And this illustrates what? The fact that you used a misconfigured setup.

> This series addresses the problem by doing paravirt switching at
> runtime.

You're not addressing the problem. Your fixing the symptom, which is
wrong to begin with.

> The alternative use-case is a runtime version of apply_alternatives()
> (not posted with this patch-set) that can be used for some safe subset
> of X86_FEATUREs. This could be useful in conjunction with the ongoing
> late microcode loading work that Mihai Carabas and others have been
> working on.

This has been discussed to death before and there is no safe subset as
long as this hasn't been resolved:

  
https://lore.kernel.org/lkml/alpine.deb.2.21.1909062237580.1...@nanos.tec.linutronix.de/

Thanks,

tglx
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 0/5] x86/vmware: Steal time accounting support

2020-03-12 Thread Thomas Gleixner
Alexey Makhalov  writes:
>
> Alexey Makhalov (5):
>   x86/vmware: Make vmware_select_hypercall() __init
>   x86/vmware: Remove vmware_sched_clock_setup()
>   x86/vmware: Steal time clock for VMware guest
>   x86/vmware: Enable steal time accounting
>   x86/vmware: Use bool type for vmw_sched_clock

Reviewed-by: Thomas Gleixner 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] x86/ioperm: add new paravirt function update_io_bitmap

2020-02-28 Thread Thomas Gleixner
Jürgen Groß  writes:

> Friendly ping...

Ooops. I pick it up first thing tomorrow morning
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] x86/ioperm: add new paravirt function update_io_bitmap

2020-02-19 Thread Thomas Gleixner
Jürgen Groß  writes:
> On 18.02.20 22:03, Thomas Gleixner wrote:
>> BTW, why isn't stuff like this not catched during next or at least
>> before the final release? Is nothing running CI on upstream with all
>> that XEN muck active?
>
> This problem showed up by not being able to start the X server (probably
> not the freshest one) in dom0 on a moderate aged AMD system.
>
> Our CI tests tend do be more text console based for dom0.

tools/testing/selftests/x86/io[perm|pl] should have caught that as well,
right? If not, we need to fix the selftests.

Thanks,

tglx
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] x86/ioperm: add new paravirt function update_io_bitmap

2020-02-18 Thread Thomas Gleixner
Juergen Gross  writes:
> Commit 111e7b15cf10f6 ("x86/ioperm: Extend IOPL config to control
> ioperm() as well") reworked the iopl syscall to use I/O bitmaps.
>
> Unfortunately this broke Xen PV domains using that syscall as there
> is currently no I/O bitmap support in PV domains.
>
> Add I/O bitmap support via a new paravirt function update_io_bitmap
> which Xen PV domains can use to update their I/O bitmaps via a
> hypercall.
>
> Fixes: 111e7b15cf10f6 ("x86/ioperm: Extend IOPL config to control ioperm() as 
> well")
> Reported-by: Jan Beulich 
> Cc:  # 5.5
> Signed-off-by: Juergen Gross 
> Reviewed-by: Jan Beulich 
> Tested-by: Jan Beulich 

Duh, sorry about that and thanks for fixing it.

BTW, why isn't stuff like this not catched during next or at least
before the final release? Is nothing running CI on upstream with all
that XEN muck active?

Thanks,

tglx
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 0/2] i8253: Fix PIT shutdown quirk on Hyper-V

2018-11-03 Thread Thomas Gleixner
On Fri, 2 Nov 2018, Juergen Gross wrote:
> On 01/11/2018 18:30, Michael Kelley wrote:
> > pit_shutdown() doesn't work on Hyper-V because of a quirk in the
> > PIT emulation. This problem exists in all versions of Hyper-V and
> > had not been noticed previously. When the counter register is set
> > to zero, the emulated PIT continues to interrupt @18.2 HZ.
> > 
> > So add a test for running on Hyper-V, and use that test to skip
> > setting the counter register when running on Hyper-V.
> > 
> > This patch replaces a previously proposed patch with a different
> > approach. This new approach follows comments from Thomas Gleixner.
> 
> Did you consider using a static_key instead? You could set it in
> ms_hyperv_init_platform(). This would enable you to support future
> Hyper-V versions which don't require avoiding to set the count to zero.

Duh. Now that you say it it's more than obvious. Instead of checking for
running on hyperv have a quirk check in that function and set it from
hyperv. Not necessarily a static key required as this is not a fast path,
so a simple ro_after_init marked variable is good enough

Michael, sorry for guiding you down the wrong road. Juergens idea is much
better.

If you redo that, could you please make sure, that your mail series is
properly threaded? i.e. the 1..n/n mails contain a

 Reference: 

Thanks,

tglx


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support

2018-10-03 Thread Thomas Gleixner
On Wed, 3 Oct 2018, Andy Lutomirski wrote:
> > On Oct 3, 2018, at 5:01 AM, Vitaly Kuznetsov  wrote:
> > Not all Hyper-V hosts support reenlightenment notifications (and, if I'm
> > not mistaken, you need to enable nesting for the VM to get the feature -
> > and most VMs don't have this) so I think we'll have to keep Hyper-V
> > vclock for the time being.
> > 
> But this does suggest that the correct way to pass a clock through to an
> L2 guest where L0 is HV is to make L1 use the “tsc” clock and L2 use
> kvmclock (or something newer and better).  This would require adding
> support for atomic frequency changes all the way through the timekeeping
> and arch code.
>
> John, tglx, would that be okay or crazy?

Not sure what you mean. I think I lost you somewhere on the way.

Thanks,

tglx___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case

2018-09-27 Thread Thomas Gleixner
On Tue, 18 Sep 2018, Thomas Gleixner wrote:
> On Tue, 18 Sep 2018, Thomas Gleixner wrote:
> > So if the TSC on CPU1 is slightly behind the TSC on CPU0 then now1 can be
> > smaller than cycle_last. The TSC sync stuff does not catch the small delta
> > for unknown raisins. I'll go and find that machine and test that again.
> 
> Of course it does not trigger anymore. We accumulated code between the
> point in timekeeping_advance() where the TSC is read and the update of the
> VDSO data.
> 
> I'll might have to get an 2.6ish kernel booted on that machine and try with
> that again. /me shudders

Actually it does happen, because the TSC is very slowly drifting apart due
to SMI wreckage trying to hide itself. It just takes a very long time.

Thanks,

tglx


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case

2018-09-27 Thread Thomas Gleixner
On Wed, 19 Sep 2018, Thomas Gleixner wrote:
> On Tue, 18 Sep 2018, Andy Lutomirski wrote:
> > > On Sep 18, 2018, at 3:46 PM, Thomas Gleixner  wrote:
> > > On Tue, 18 Sep 2018, Andy Lutomirski wrote:
> > >> Do we do better if we use signed arithmetic for the whole calculation?
> > >> Then a small backwards movement would result in a small backwards result.
> > >> Or we could offset everything so that we’d have to go back several
> > >> hundred ms before we cross zero.
> > > 
> > > That would be probably the better solution as signed math would be
> > > problematic when the resulting ns value becomes negative. As the delta is
> > > really small, otherwise the TSC sync check would have caught it, the 
> > > caller
> > > should never be able to observe time going backwards.
> > > 
> > > I'll have a look into that. It needs some thought vs. the fractional part
> > > of the base time, but it should be not rocket science to get that
> > > correct. Famous last words...
> > > 
> > 
> > It’s also fiddly to tune. If you offset it too much, then the fancy
> > divide-by-repeated-subtraction loop will hurt more than the comparison to
> > last.
> 
> Not really. It's sufficient to offset it by at max. 1000 cycles or so. That
> won't hurt the magic loop, but it will definitely cover that slight offset
> case.

I got it working, but first of all the gain is close to 0.

There is this other subtle issue that we've seen TSCs slowly drifting apart
which is caught by the TSC watchdog eventually, but if it exeeds the offset
_before_ the watchdog triggers, we're back to square one.

So I rather stay on the safe side and just accept that we have to deal with
that. Sigh.

Thanks,

tglx___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case

2018-09-19 Thread Thomas Gleixner
On Wed, 19 Sep 2018, Rasmus Villemoes wrote:
> On 2018-09-19 00:46, Thomas Gleixner wrote:
> > On Tue, 18 Sep 2018, Andy Lutomirski wrote:
> >>>
> >>
> >> Do we do better if we use signed arithmetic for the whole calculation?
> >> Then a small backwards movement would result in a small backwards result.
> >> Or we could offset everything so that we’d have to go back several
> >> hundred ms before we cross zero.
> > 
> > That would be probably the better solution as signed math would be
> > problematic when the resulting ns value becomes negative. As the delta is
> > really small, otherwise the TSC sync check would have caught it, the caller
> > should never be able to observe time going backwards.
> > 
> > I'll have a look into that. It needs some thought vs. the fractional part
> > of the base time, but it should be not rocket science to get that
> > correct. Famous last words...
> 
> Does the sentinel need to be U64_MAX? What if vgetcyc and its minions
> returned gtod->cycle_last-1 (for some value of 1), and the caller just
> does "if ((s64)cycles - (s64)last < 0) return fallback; ns +=
> (cycles-last)* ...". That should just be a "sub ; js ; ". It's an extra
> load of ->cycle_last, but only on the path where we're heading for the
> fallback anyway. The value of 1 can be adjusted so that in the "js"
> path, we could detect and accept an rdtsc_ordered() call that's just a
> few 10s of cycles behind last and treat that as 0 and continue back on
> the normal path. But maybe it's hard to get gcc to generate the expected
> code.

I played around with a lot of variants and GCC generates all kinds of
interesting ASM. And at some point optimizing that math code is not buying
anything because the LFENCE before RDTSC is dominating all of it.

Thanks,

tglx___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case

2018-09-18 Thread Thomas Gleixner
On Tue, 18 Sep 2018, Andy Lutomirski wrote:
> > On Sep 18, 2018, at 3:46 PM, Thomas Gleixner  wrote:
> > On Tue, 18 Sep 2018, Andy Lutomirski wrote:
> >> Do we do better if we use signed arithmetic for the whole calculation?
> >> Then a small backwards movement would result in a small backwards result.
> >> Or we could offset everything so that we’d have to go back several
> >> hundred ms before we cross zero.
> > 
> > That would be probably the better solution as signed math would be
> > problematic when the resulting ns value becomes negative. As the delta is
> > really small, otherwise the TSC sync check would have caught it, the caller
> > should never be able to observe time going backwards.
> > 
> > I'll have a look into that. It needs some thought vs. the fractional part
> > of the base time, but it should be not rocket science to get that
> > correct. Famous last words...
> > 
> 
> It’s also fiddly to tune. If you offset it too much, then the fancy
> divide-by-repeated-subtraction loop will hurt more than the comparison to
> last.

Not really. It's sufficient to offset it by at max. 1000 cycles or so. That
won't hurt the magic loop, but it will definitely cover that slight offset
case.

Thanks,

tglx

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case

2018-09-18 Thread Thomas Gleixner
On Tue, 18 Sep 2018, Andy Lutomirski wrote:
> > On Sep 18, 2018, at 12:52 AM, Thomas Gleixner  wrote:
> > 
> >> On Mon, 17 Sep 2018, John Stultz wrote:
> >>> On Mon, Sep 17, 2018 at 12:25 PM, Andy Lutomirski  wrote:
> >>> Also, I'm not entirely convinced that this "last" thing is needed at
> >>> all.  John, what's the scenario under which we need it?
> >> 
> >> So my memory is probably a bit foggy, but I recall that as we
> >> accelerated gettimeofday, we found that even on systems that claimed
> >> to have synced TSCs, they were actually just slightly out of sync.
> >> Enough that right after cycles_last had been updated, a read on
> >> another cpu could come in just behind cycles_last, resulting in a
> >> negative interval causing lots of havoc.
> >> 
> >> So the sanity check is needed to avoid that case.
> > 
> > Your memory serves you right. That's indeed observable on CPUs which
> > lack TSC_ADJUST.
> > 
> > @Andy: Welcome to the wonderful world of TSC.
> > 
> 
> Do we do better if we use signed arithmetic for the whole calculation?
> Then a small backwards movement would result in a small backwards result.
> Or we could offset everything so that we’d have to go back several
> hundred ms before we cross zero.

That would be probably the better solution as signed math would be
problematic when the resulting ns value becomes negative. As the delta is
really small, otherwise the TSC sync check would have caught it, the caller
should never be able to observe time going backwards.

I'll have a look into that. It needs some thought vs. the fractional part
of the base time, but it should be not rocket science to get that
correct. Famous last words...

Thanks,

tglx


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case

2018-09-18 Thread Thomas Gleixner
On Tue, 18 Sep 2018, Thomas Gleixner wrote:
> So if the TSC on CPU1 is slightly behind the TSC on CPU0 then now1 can be
> smaller than cycle_last. The TSC sync stuff does not catch the small delta
> for unknown raisins. I'll go and find that machine and test that again.

Of course it does not trigger anymore. We accumulated code between the
point in timekeeping_advance() where the TSC is read and the update of the
VDSO data.

I'll might have to get an 2.6ish kernel booted on that machine and try with
that again. /me shudders

Thanks,

tglx
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case

2018-09-18 Thread Thomas Gleixner
On Tue, 18 Sep 2018, Peter Zijlstra wrote:
> On Tue, Sep 18, 2018 at 12:41:57PM +0200, Thomas Gleixner wrote:
> > I still have one of the machines which is affected by this.
> 
> Are we sure this isn't a load vs rdtsc reorder? Because if I look at the
> current code:

The load order of last vs. rdtsc does not matter at all.

CPU0CPU1


now0 = rdtsc_ordered();
...
tk->cycle_last = now0;

gtod->seq++;
gtod->cycle_last = tk->cycle_last;
...
gtod->seq++;
seq_begin(gtod->seq);
now1 = rdtsc_ordered();

So if the TSC on CPU1 is slightly behind the TSC on CPU0 then now1 can be
smaller than cycle_last. The TSC sync stuff does not catch the small delta
for unknown raisins. I'll go and find that machine and test that again.

Thanks,

tglx



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case

2018-09-18 Thread Thomas Gleixner
On Tue, 18 Sep 2018, Thomas Gleixner wrote:
> On Tue, 18 Sep 2018, Thomas Gleixner wrote:
> > On Tue, 18 Sep 2018, Peter Zijlstra wrote:
> > > > Your memory serves you right. That's indeed observable on CPUs which
> > > > lack TSC_ADJUST.
> > > 
> > > But, if the gtod code can observe this, then why doesn't the code that
> > > checks the sync?
> > 
> > Because it depends where the involved CPUs are in the topology. The sync
> > code might just run on the same package an simply not see it. Yes, w/o
> > TSC_ADJUST the TSC sync code can just fail to see the havoc.
> 
> Even with TSC adjust the TSC can be slightly off by design on multi-socket
> systems.

Here are the gory details:

   
https://lore.kernel.org/lkml/3c1737210708230408i7a8049a9m5db49e6c4d89a...@mail.gmail.com/

The changelog has an explanation as well.

d8bb6f4c1670 ("x86: tsc prevent time going backwards")

I still have one of the machines which is affected by this.

Thanks,

tglx
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case

2018-09-18 Thread Thomas Gleixner
On Tue, 18 Sep 2018, Thomas Gleixner wrote:
> On Tue, 18 Sep 2018, Peter Zijlstra wrote:
> > > Your memory serves you right. That's indeed observable on CPUs which
> > > lack TSC_ADJUST.
> > 
> > But, if the gtod code can observe this, then why doesn't the code that
> > checks the sync?
> 
> Because it depends where the involved CPUs are in the topology. The sync
> code might just run on the same package an simply not see it. Yes, w/o
> TSC_ADJUST the TSC sync code can just fail to see the havoc.

Even with TSC adjust the TSC can be slightly off by design on multi-socket
systems.

Thanks,

tglx
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case

2018-09-18 Thread Thomas Gleixner
On Tue, 18 Sep 2018, Peter Zijlstra wrote:
> On Tue, Sep 18, 2018 at 09:52:26AM +0200, Thomas Gleixner wrote:
> > On Mon, 17 Sep 2018, John Stultz wrote:
> > > On Mon, Sep 17, 2018 at 12:25 PM, Andy Lutomirski  wrote:
> > > > Also, I'm not entirely convinced that this "last" thing is needed at
> > > > all.  John, what's the scenario under which we need it?
> > > 
> > > So my memory is probably a bit foggy, but I recall that as we
> > > accelerated gettimeofday, we found that even on systems that claimed
> > > to have synced TSCs, they were actually just slightly out of sync.
> > > Enough that right after cycles_last had been updated, a read on
> > > another cpu could come in just behind cycles_last, resulting in a
> > > negative interval causing lots of havoc.
> > > 
> > > So the sanity check is needed to avoid that case.
> > 
> > Your memory serves you right. That's indeed observable on CPUs which
> > lack TSC_ADJUST.
> 
> But, if the gtod code can observe this, then why doesn't the code that
> checks the sync?

Because it depends where the involved CPUs are in the topology. The sync
code might just run on the same package an simply not see it. Yes, w/o
TSC_ADJUST the TSC sync code can just fail to see the havoc.

Thanks,

tglx




___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case

2018-09-18 Thread Thomas Gleixner
On Mon, 17 Sep 2018, John Stultz wrote:
> On Mon, Sep 17, 2018 at 12:25 PM, Andy Lutomirski  wrote:
> > Also, I'm not entirely convinced that this "last" thing is needed at
> > all.  John, what's the scenario under which we need it?
> 
> So my memory is probably a bit foggy, but I recall that as we
> accelerated gettimeofday, we found that even on systems that claimed
> to have synced TSCs, they were actually just slightly out of sync.
> Enough that right after cycles_last had been updated, a read on
> another cpu could come in just behind cycles_last, resulting in a
> negative interval causing lots of havoc.
> 
> So the sanity check is needed to avoid that case.

Your memory serves you right. That's indeed observable on CPUs which
lack TSC_ADJUST.

@Andy: Welcome to the wonderful world of TSC.

Thanks,

tglx

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[patch V2 11/11] x66/vdso: Add CLOCK_TAI support

2018-09-17 Thread Thomas Gleixner
With the storage array in place it's now trivial to support CLOCK_TAI in
the vdso. Extend the base time storage array and add the update code.

Signed-off-by: Thomas Gleixner 
---

V2: Remove the masking trick

 arch/x86/entry/vsyscall/vsyscall_gtod.c |4 
 arch/x86/include/asm/vgtod.h|4 ++--
 2 files changed, 6 insertions(+), 2 deletions(-)

--- a/arch/x86/entry/vsyscall/vsyscall_gtod.c
+++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c
@@ -51,6 +51,10 @@ void update_vsyscall(struct timekeeper *
base->sec = tk->xtime_sec;
base->nsec = tk->tkr_mono.xtime_nsec;
 
+   base = &vdata->basetime[CLOCK_TAI];
+   base->sec = tk->xtime_sec + (s64)tk->tai_offset;
+   base->nsec = tk->tkr_mono.xtime_nsec;
+
base = &vdata->basetime[CLOCK_MONOTONIC];
base->sec = tk->xtime_sec + tk->wall_to_monotonic.tv_sec;
nsec = tk->tkr_mono.xtime_nsec;
--- a/arch/x86/include/asm/vgtod.h
+++ b/arch/x86/include/asm/vgtod.h
@@ -18,8 +18,8 @@ struct vgtod_ts {
u64 nsec;
 };
 
-#define VGTOD_BASES(CLOCK_MONOTONIC_COARSE + 1)
-#define VGTOD_HRES (BIT(CLOCK_REALTIME) | BIT(CLOCK_MONOTONIC))
+#define VGTOD_BASES(CLOCK_TAI + 1)
+#define VGTOD_HRES (BIT(CLOCK_REALTIME) | BIT(CLOCK_MONOTONIC) | 
BIT(CLOCK_TAI))
 #define VGTOD_COARSE   (BIT(CLOCK_REALTIME_COARSE) | 
BIT(CLOCK_MONOTONIC_COARSE))
 
 /*


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[patch V2 10/11] x86/vdso: Move cycle_last handling into the caller

2018-09-17 Thread Thomas Gleixner
Dereferencing gtod->cycle_last all over the place and foing the cycles <
last comparison in the vclock read functions generates horrible code. Doing
it at the call site is much better and gains a few cycles both for TSC and
pvclock.

Caveat: This adds the comparison to the hyperv vclock as well, but I have
no way to test that.

Signed-off-by: Thomas Gleixner 
---
 arch/x86/entry/vdso/vclock_gettime.c |   39 ++-
 1 file changed, 7 insertions(+), 32 deletions(-)

--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -76,9 +76,8 @@ static notrace const struct pvclock_vsys
 static notrace u64 vread_pvclock(void)
 {
const struct pvclock_vcpu_time_info *pvti = &get_pvti0()->pvti;
-   u64 ret;
-   u64 last;
u32 version;
+   u64 ret;
 
/*
 * Note: The kernel and hypervisor must guarantee that cpu ID
@@ -111,13 +110,7 @@ static notrace u64 vread_pvclock(void)
ret = __pvclock_read_cycles(pvti, rdtsc_ordered());
} while (pvclock_read_retry(pvti, version));
 
-   /* refer to vread_tsc() comment for rationale */
-   last = gtod->cycle_last;
-
-   if (likely(ret >= last))
-   return ret;
-
-   return last;
+   return ret;
 }
 #endif
 #ifdef CONFIG_HYPERV_TSCPAGE
@@ -130,30 +123,10 @@ static notrace u64 vread_hvclock(void)
 }
 #endif
 
-notrace static u64 vread_tsc(void)
-{
-   u64 ret = (u64)rdtsc_ordered();
-   u64 last = gtod->cycle_last;
-
-   if (likely(ret >= last))
-   return ret;
-
-   /*
-* GCC likes to generate cmov here, but this branch is extremely
-* predictable (it's just a function of time and the likely is
-* very likely) and there's a data dependence, so force GCC
-* to generate a branch instead.  I don't barrier() because
-* we don't actually need a barrier, and if this function
-* ever gets inlined it will generate worse code.
-*/
-   asm volatile ("");
-   return last;
-}
-
 notrace static inline u64 vgetcyc(int mode)
 {
if (mode == VCLOCK_TSC)
-   return vread_tsc();
+   return (u64)rdtsc_ordered();
 #ifdef CONFIG_PARAVIRT_CLOCK
else if (mode == VCLOCK_PVCLOCK)
return vread_pvclock();
@@ -168,17 +141,19 @@ notrace static inline u64 vgetcyc(int mo
 notrace static int do_hres(clockid_t clk, struct timespec *ts)
 {
struct vgtod_ts *base = >od->basetime[clk];
+   u64 cycles, last, ns;
unsigned int seq;
-   u64 cycles, ns;
 
do {
seq = gtod_read_begin(gtod);
ts->tv_sec = base->sec;
ns = base->nsec;
+   last = gtod->cycle_last;
cycles = vgetcyc(gtod->vclock_mode);
if (unlikely((s64)cycles < 0))
return vdso_fallback_gettime(clk, ts);
-   ns += (cycles - gtod->cycle_last) * gtod->mult;
+   if (cycles > last)
+   ns += (cycles - last) * gtod->mult;
ns >>= gtod->shift;
} while (unlikely(gtod_read_retry(gtod, seq)));
 


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[patch V2 08/11] x86/vdso: Replace the clockid switch case

2018-09-17 Thread Thomas Gleixner
Now that the time getter functions use the clockid as index into the
storage array for the base time access, the switch case can be replaced.

- Check for clockid >= MAX_CLOCKS and for negative clockid (CPU/FD) first
  and call the fallback function right away.

- After establishing that clockid is < MAX_CLOCKS, convert the clockid to a
  bitmask

- Check for the supported high resolution and coarse functions by anding
  the bitmask of supported clocks and check whether a bit is set.

This completely avoids jump tables, reduces the number of conditionals and
makes the VDSO extensible for other clock ids.

Signed-off-by: Thomas Gleixner 
---
 arch/x86/entry/vdso/vclock_gettime.c |   38 ---
 1 file changed, 18 insertions(+), 20 deletions(-)

--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -239,29 +239,27 @@ notrace static void do_coarse(clockid_t
 
 notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts)
 {
-   switch (clock) {
-   case CLOCK_REALTIME:
-   if (do_hres(CLOCK_REALTIME, ts) == VCLOCK_NONE)
-   goto fallback;
-   break;
-   case CLOCK_MONOTONIC:
-   if (do_hres(CLOCK_MONOTONIC, ts) == VCLOCK_NONE)
-   goto fallback;
-   break;
-   case CLOCK_REALTIME_COARSE:
-   do_coarse(CLOCK_REALTIME_COARSE, ts);
-   break;
-   case CLOCK_MONOTONIC_COARSE:
-   do_coarse(CLOCK_MONOTONIC_COARSE, ts);
-   break;
-   default:
-   goto fallback;
-   }
+   unsigned int msk;
+
+   /* Sort out negative (CPU/FD) and invalid clocks */
+   if (unlikely((unsigned int) clock >= MAX_CLOCKS))
+   return vdso_fallback_gettime(clock, ts);
 
-   return 0;
-fallback:
+   /*
+* Convert the clockid to a bitmask and use it to check which
+* clocks are handled in the VDSO directly.
+*/
+   msk = 1U << clock;
+   if (likely(msk & VGTOD_HRES)) {
+   if (do_hres(clock, ts) != VCLOCK_NONE)
+   return 0;
+   } else if (msk & VGTOD_COARSE) {
+   do_coarse(clock, ts);
+   return 0;
+   }
return vdso_fallback_gettime(clock, ts);
 }
+
 int clock_gettime(clockid_t, struct timespec *)
__attribute__((weak, alias("__vdso_clock_gettime")));
 


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[patch V2 09/11] x86/vdso: Simplify the invalid vclock case

2018-09-17 Thread Thomas Gleixner
The code flow for the vclocks is convoluted as it requires the vclocks
which can be invalidated separately from the vsyscall_gtod_data sequence to
store the fact in a separate variable. That's inefficient.

Restructure the code so the vclock readout returns cycles and the
conversion to nanoseconds is handled at the call site.

If the clock gets invalidated or vclock is already VCLOCK_NONE, return
U64_MAX as the cycle value, which is invalid for all clocks and leave the
sequence loop immediately in that case by calling the fallback function
directly.

This allows to remove the gettimeofday fallback as it now uses the
clock_gettime() fallback and does the nanoseconds to microseconds
conversion in the same way as it does when the vclock is functional. It
does not make a difference whether the division by 1000 happens in the
kernel fallback or in userspace.

Generates way better code and gains a few cycles back.

Signed-off-by: Thomas Gleixner 
---
 arch/x86/entry/vdso/vclock_gettime.c |   81 +--
 1 file changed, 21 insertions(+), 60 deletions(-)

--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -48,16 +48,6 @@ notrace static long vdso_fallback_gettim
return ret;
 }
 
-notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz)
-{
-   long ret;
-
-   asm("syscall" : "=a" (ret) :
-   "0" (__NR_gettimeofday), "D" (tv), "S" (tz) : "memory");
-   return ret;
-}
-
-
 #else
 
 notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
@@ -75,21 +65,6 @@ notrace static long vdso_fallback_gettim
return ret;
 }
 
-notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz)
-{
-   long ret;
-
-   asm(
-   "mov %%ebx, %%edx \n"
-   "mov %2, %%ebx \n"
-   "call __kernel_vsyscall \n"
-   "mov %%edx, %%ebx \n"
-   : "=a" (ret)
-   : "0" (__NR_gettimeofday), "g" (tv), "c" (tz)
-   : "memory", "edx");
-   return ret;
-}
-
 #endif
 
 #ifdef CONFIG_PARAVIRT_CLOCK
@@ -98,7 +73,7 @@ static notrace const struct pvclock_vsys
return (const struct pvclock_vsyscall_time_info *)&pvclock_page;
 }
 
-static notrace u64 vread_pvclock(int *mode)
+static notrace u64 vread_pvclock(void)
 {
const struct pvclock_vcpu_time_info *pvti = &get_pvti0()->pvti;
u64 ret;
@@ -130,10 +105,8 @@ static notrace u64 vread_pvclock(int *mo
do {
version = pvclock_read_begin(pvti);
 
-   if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
-   *mode = VCLOCK_NONE;
-   return 0;
-   }
+   if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT)))
+   return U64_MAX;
 
ret = __pvclock_read_cycles(pvti, rdtsc_ordered());
} while (pvclock_read_retry(pvti, version));
@@ -148,17 +121,12 @@ static notrace u64 vread_pvclock(int *mo
 }
 #endif
 #ifdef CONFIG_HYPERV_TSCPAGE
-static notrace u64 vread_hvclock(int *mode)
+static notrace u64 vread_hvclock(void)
 {
const struct ms_hyperv_tsc_page *tsc_pg =
(const struct ms_hyperv_tsc_page *)&hvclock_page;
-   u64 current_tick = hv_read_tsc_page(tsc_pg);
-
-   if (current_tick != U64_MAX)
-   return current_tick;
 
-   *mode = VCLOCK_NONE;
-   return 0;
+   return hv_read_tsc_page(tsc_pg);
 }
 #endif
 
@@ -182,47 +150,42 @@ notrace static u64 vread_tsc(void)
return last;
 }
 
-notrace static inline u64 vgetsns(int *mode)
+notrace static inline u64 vgetcyc(int mode)
 {
-   u64 v;
-   cycles_t cycles;
-
-   if (gtod->vclock_mode == VCLOCK_TSC)
-   cycles = vread_tsc();
+   if (mode == VCLOCK_TSC)
+   return vread_tsc();
 #ifdef CONFIG_PARAVIRT_CLOCK
-   else if (gtod->vclock_mode == VCLOCK_PVCLOCK)
-   cycles = vread_pvclock(mode);
+   else if (mode == VCLOCK_PVCLOCK)
+   return vread_pvclock();
 #endif
 #ifdef CONFIG_HYPERV_TSCPAGE
-   else if (gtod->vclock_mode == VCLOCK_HVCLOCK)
-   cycles = vread_hvclock(mode);
+   else if (mode == VCLOCK_HVCLOCK)
+   return vread_hvclock();
 #endif
-   else
-   return 0;
-   v = cycles - gtod->cycle_last;
-   return v * gtod->mult;
+   return U64_MAX;
 }
 
 notrace static int do_hres(clockid_t clk, struct timespec *ts)
 {
struct vgtod_ts *base = >od->basetime[clk];
unsigned int seq;
-   int mode;
-   u64 ns;
+   u64 cycles, ns;
 
do {
seq = gtod_read_begin(gtod);
-   mode = gtod->vcl

[patch V2 06/11] x86/vdso: Collapse high resolution functions

2018-09-17 Thread Thomas Gleixner
do_realtime() and do_monotonic() are now the same except for the storage
array index. Hand the index in as an argument and collapse the functions.

Signed-off-by: Thomas Gleixner 
---
 arch/x86/entry/vdso/vclock_gettime.c |   35 +++
 1 file changed, 7 insertions(+), 28 deletions(-)

--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -203,35 +203,12 @@ notrace static inline u64 vgetsns(int *m
return v * gtod->mult;
 }
 
-/* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
-notrace static int __always_inline do_realtime(struct timespec *ts)
+notrace static int do_hres(clockid_t clk, struct timespec *ts)
 {
-   struct vgtod_ts *base = >od->basetime[CLOCK_REALTIME];
+   struct vgtod_ts *base = >od->basetime[clk];
unsigned int seq;
-   u64 ns;
int mode;
-
-   do {
-   seq = gtod_read_begin(gtod);
-   mode = gtod->vclock_mode;
-   ts->tv_sec = base->sec;
-   ns = base->nsec;
-   ns += vgetsns(&mode);
-   ns >>= gtod->shift;
-   } while (unlikely(gtod_read_retry(gtod, seq)));
-
-   ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
-   ts->tv_nsec = ns;
-
-   return mode;
-}
-
-notrace static int __always_inline do_monotonic(struct timespec *ts)
-{
-   struct vgtod_ts *base = >od->basetime[CLOCK_MONOTONIC];
-   unsigned int seq;
u64 ns;
-   int mode;
 
do {
seq = gtod_read_begin(gtod);
@@ -276,11 +253,11 @@ notrace int __vdso_clock_gettime(clockid
 {
switch (clock) {
case CLOCK_REALTIME:
-   if (do_realtime(ts) == VCLOCK_NONE)
+   if (do_hres(CLOCK_REALTIME, ts) == VCLOCK_NONE)
goto fallback;
break;
case CLOCK_MONOTONIC:
-   if (do_monotonic(ts) == VCLOCK_NONE)
+   if (do_hres(CLOCK_MONOTONIC, ts) == VCLOCK_NONE)
goto fallback;
break;
case CLOCK_REALTIME_COARSE:
@@ -303,7 +280,9 @@ int clock_gettime(clockid_t, struct time
 notrace int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz)
 {
if (likely(tv != NULL)) {
-   if (unlikely(do_realtime((struct timespec *)tv) == VCLOCK_NONE))
+   struct timespec *ts = (struct timespec *) tv;
+
+   if (unlikely(do_hres(CLOCK_REALTIME, ts) == VCLOCK_NONE))
return vdso_fallback_gtod(tv, tz);
tv->tv_usec /= 1000;
}


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[patch V2 05/11] x86/vdso: Introduce and use vgtod_ts

2018-09-17 Thread Thomas Gleixner
It's desired to support more clocks in the VDSO, e.g. CLOCK_TAI. This
results either in indirect calls due to the larger switch case, which then
requires retpolines or when the compiler is forced to avoid jump tables it
results in even more conditionals.

To avoid both variants which are bad for performance the high resolution
functions and the coarse grained functions will be collapsed into one for
each. That requires to store the clock specific base time in an array.

Introcude struct vgtod_ts for storage and convert the data store, the
update function and the individual clock functions over to use it.

The new storage does not longer use gtod_long_t for seconds depending on 32
or 64 bit compile because this needs to be the full 64bit value even for
32bit when a Y2038 function is added. No point in keeping the distinction
alive in the internal representation.

Signed-off-by: Thomas Gleixner 
---
 arch/x86/entry/vdso/vclock_gettime.c|   24 +--
 arch/x86/entry/vsyscall/vsyscall_gtod.c |   51 
 arch/x86/include/asm/vgtod.h|   36 --
 3 files changed, 61 insertions(+), 50 deletions(-)

--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -206,6 +206,7 @@ notrace static inline u64 vgetsns(int *m
 /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
 notrace static int __always_inline do_realtime(struct timespec *ts)
 {
+   struct vgtod_ts *base = >od->basetime[CLOCK_REALTIME];
unsigned int seq;
u64 ns;
int mode;
@@ -213,8 +214,8 @@ notrace static int __always_inline do_re
do {
seq = gtod_read_begin(gtod);
mode = gtod->vclock_mode;
-   ts->tv_sec = gtod->wall_time_sec;
-   ns = gtod->wall_time_snsec;
+   ts->tv_sec = base->sec;
+   ns = base->nsec;
ns += vgetsns(&mode);
ns >>= gtod->shift;
} while (unlikely(gtod_read_retry(gtod, seq)));
@@ -227,6 +228,7 @@ notrace static int __always_inline do_re
 
 notrace static int __always_inline do_monotonic(struct timespec *ts)
 {
+   struct vgtod_ts *base = >od->basetime[CLOCK_MONOTONIC];
unsigned int seq;
u64 ns;
int mode;
@@ -234,8 +236,8 @@ notrace static int __always_inline do_mo
do {
seq = gtod_read_begin(gtod);
mode = gtod->vclock_mode;
-   ts->tv_sec = gtod->monotonic_time_sec;
-   ns = gtod->monotonic_time_snsec;
+   ts->tv_sec = base->sec;
+   ns = base->nsec;
ns += vgetsns(&mode);
ns >>= gtod->shift;
} while (unlikely(gtod_read_retry(gtod, seq)));
@@ -248,21 +250,25 @@ notrace static int __always_inline do_mo
 
 notrace static void do_realtime_coarse(struct timespec *ts)
 {
+   struct vgtod_ts *base = >od->basetime[CLOCK_REALTIME_COARSE];
unsigned int seq;
+
do {
seq = gtod_read_begin(gtod);
-   ts->tv_sec = gtod->wall_time_coarse_sec;
-   ts->tv_nsec = gtod->wall_time_coarse_nsec;
+   ts->tv_sec = base->sec;
+   ts->tv_nsec = base->nsec;
} while (unlikely(gtod_read_retry(gtod, seq)));
 }
 
 notrace static void do_monotonic_coarse(struct timespec *ts)
 {
+   struct vgtod_ts *base = >od->basetime[CLOCK_MONOTONIC_COARSE];
unsigned int seq;
+
do {
seq = gtod_read_begin(gtod);
-   ts->tv_sec = gtod->monotonic_time_coarse_sec;
-   ts->tv_nsec = gtod->monotonic_time_coarse_nsec;
+   ts->tv_sec = base->sec;
+   ts->tv_nsec = base->nsec;
} while (unlikely(gtod_read_retry(gtod, seq)));
 }
 
@@ -318,7 +324,7 @@ int gettimeofday(struct timeval *, struc
 notrace time_t __vdso_time(time_t *t)
 {
/* This is atomic on x86 so we don't need any locks. */
-   time_t result = READ_ONCE(gtod->wall_time_sec);
+   time_t result = READ_ONCE(gtod->basetime[CLOCK_REALTIME].sec);
 
if (t)
*t = result;
--- a/arch/x86/entry/vsyscall/vsyscall_gtod.c
+++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c
@@ -31,6 +31,8 @@ void update_vsyscall(struct timekeeper *
 {
int vclock_mode = tk->tkr_mono.clock->archdata.vclock_mode;
struct vsyscall_gtod_data *vdata = &vsyscall_gtod_data;
+   struct vgtod_ts *base;
+   u64 nsec;
 
/* Mark the new vclock used. */
BUILD_BUG_ON(VCLOCK_MAX >= 32);
@@ -45,34 +47,33 @@ void update_vsyscall(struct timekeeper *
vdata->mult = tk->tkr_mono.mult;
vdata->shift= tk->tkr_mono.shift;
 
-   vdata->wall_time_sec= tk->xtime_sec;
-  

  1   2   >