subject:"\[PATCH v2 00\/10\] PCID and improved laziness"

Re: [PATCH v2 00/10] PCID and improved laziness

2017-06-18 Thread Andy Lutomirski

On Sun, Jun 18, 2017 at 2:29 PM, Levin, Alexander (Sasha Levin)
 wrote:
> On Tue, Jun 13, 2017 at 09:56:18PM -0700, Andy Lutomirski wrote:
>>There are three performance benefits here:
>>
>>1. TLB flushing is slow.  (I.e. the flush itself takes a while.)
>>   This avoids many of them when switching tasks by using PCID.  In
>>   a stupid little benchmark I did, it saves about 100ns on my laptop
>>   per context switch.  I'll try to improve that benchmark.
>>
>>2. Mms that have been used recently on a given CPU might get to keep
>>   their TLB entries alive across process switches with this patch
>>   set.  TLB fills are pretty fast on modern CPUs, but they're even
>>   faster when they don't happen.
>>
>>3. Lazy TLB is way better.  We used to do two stupid things when we
>>   ran kernel threads: we'd send IPIs to flush user contexts on their
>>   CPUs and then we'd write to CR3 for no particular reason as an excuse
>>   to stop further IPIs.  With this patch, we do neither.
>>
>>This will, in general, perform suboptimally if paravirt TLB flushing
>>is in use (currently just Xen, I think, but Hyper-V is in the works).
>>The code is structured so we could fix it in one of two ways: we
>>could take a spinlock when touching the percpu state so we can update
>>it remotely after a paravirt flush, or we could be more careful about
>>our exactly how we access the state and use cmpxchg16b to do atomic
>>remote updates.  (On SMP systems without cmpxchg16b, we'd just skip
>>the optimization entirely.)
>
> Hey Andy,
>
> I've started seeing the following in -next:
>
> [ cut here ]
> kernel BUG at arch/x86/mm/tlb.c:47!

...

> Call Trace:
>  flush_tlb_func_local arch/x86/mm/tlb.c:239 [inline]
>  flush_tlb_mm_range+0x26d/0x370 arch/x86/mm/tlb.c:317
>  flush_tlb_page arch/x86/include/asm/tlbflush.h:253 [inline]

I think I see what's going on, and it should be fixed in the PCID
series.  I'll split out the fix.

Re: [PATCH v2 00/10] PCID and improved laziness

2017-06-18 Thread Levin, Alexander (Sasha Levin)

On Tue, Jun 13, 2017 at 09:56:18PM -0700, Andy Lutomirski wrote:
>There are three performance benefits here:
>
>1. TLB flushing is slow.  (I.e. the flush itself takes a while.)
>   This avoids many of them when switching tasks by using PCID.  In
>   a stupid little benchmark I did, it saves about 100ns on my laptop
>   per context switch.  I'll try to improve that benchmark.
>
>2. Mms that have been used recently on a given CPU might get to keep
>   their TLB entries alive across process switches with this patch
>   set.  TLB fills are pretty fast on modern CPUs, but they're even
>   faster when they don't happen.
>
>3. Lazy TLB is way better.  We used to do two stupid things when we
>   ran kernel threads: we'd send IPIs to flush user contexts on their
>   CPUs and then we'd write to CR3 for no particular reason as an excuse
>   to stop further IPIs.  With this patch, we do neither.
>
>This will, in general, perform suboptimally if paravirt TLB flushing
>is in use (currently just Xen, I think, but Hyper-V is in the works).
>The code is structured so we could fix it in one of two ways: we
>could take a spinlock when touching the percpu state so we can update
>it remotely after a paravirt flush, or we could be more careful about
>our exactly how we access the state and use cmpxchg16b to do atomic
>remote updates.  (On SMP systems without cmpxchg16b, we'd just skip
>the optimization entirely.)

Hey Andy,

I've started seeing the following in -next:

[ cut here ]
kernel BUG at arch/x86/mm/tlb.c:47!
invalid opcode:  [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in:
CPU: 0 PID: 5302 Comm: kworker/u9:1 Not tainted 4.12.0-rc5+ #142
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1 
04/01/2014
Workqueue: writeback wb_workfn (flush-259:0)
task: 880030ad0040 task.stack: 880036e78000
RIP: 0010:leave_mm+0x33/0x40 arch/x86/mm/tlb.c:50
RSP: 0018:880036e7d4c8 EFLAGS: 00010246
RAX: 0001 RBX: 88006a65e240 RCX: dc00
RDX:  RSI: b1475fa0 RDI: 
RBP: 880036e7d638 R08: 110006dcfad1 R09: 880030ad0040
R10: 880036e7d3b8 R11:  R12: 110006dcfa9e
R13: 880036e7d6c0 R14: 880036e7d680 R15: 
FS:  () GS:88003ea0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 00c420019318 CR3: 47a28000 CR4: 000406f0
Call Trace:
 flush_tlb_func_local arch/x86/mm/tlb.c:239 [inline]
 flush_tlb_mm_range+0x26d/0x370 arch/x86/mm/tlb.c:317
 flush_tlb_page arch/x86/include/asm/tlbflush.h:253 [inline]
 ptep_clear_flush+0xd5/0x110 mm/pgtable-generic.c:86
 page_mkclean_one+0x242/0x540 mm/rmap.c:867
 rmap_walk_file+0x5e3/0xd20 mm/rmap.c:1681
 rmap_walk+0x1cd/0x2f0 mm/rmap.c:1699
 page_mkclean+0x2a0/0x380 mm/rmap.c:928
 clear_page_dirty_for_io+0x37e/0x9d0 mm/page-writeback.c:2703
 mpage_submit_page+0x77/0x230 fs/ext4/inode.c:2131
 mpage_process_page_bufs+0x427/0x500 fs/ext4/inode.c:2261
 mpage_prepare_extent_to_map+0x78d/0xcf0 fs/ext4/inode.c:2638
 ext4_writepages+0x13be/0x3dd0 fs/ext4/inode.c:2784
 do_writepages+0xff/0x170 mm/page-writeback.c:2357
 __writeback_single_inode+0x1d9/0x1480 fs/fs-writeback.c:1319
 writeback_sb_inodes+0x6e2/0x1260 fs/fs-writeback.c:1583
 wb_writeback+0x45d/0xed0 fs/fs-writeback.c:1759
 wb_do_writeback fs/fs-writeback.c:1891 [inline]
 wb_workfn+0x2b5/0x1460 fs/fs-writeback.c:1927
 process_one_work+0xbfa/0x1d30 kernel/workqueue.c:2097
 worker_thread+0x221/0x1860 kernel/workqueue.c:2231
 kthread+0x35f/0x430 kernel/kthread.c:231
 ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:425
Code: 48 3d 80 96 f8 b1 74 22 65 8b 05 f1 42 8c 53 83 f8 01 74 17 55 31 d2 48 
c7 c6 80 96 f8 b1 31 ff 48 89 e5 e8 60 ff ff ff 5d c3 c3 <0f> 0b 90 66 2e 0f 1f 
84 00 00 00 00 00 48 c7 c0 b4 10 73 b2 55 
RIP: leave_mm+0x33/0x40 arch/x86/mm/tlb.c:50 RSP: 880036e7d4c8
---[ end trace 3b5d5a6fb6e394f8 ]---
Kernel panic - not syncing: Fatal exception
Dumping ftrace buffer:
   (ftrace buffer empty)
Kernel Offset: 0x2b80 from 0x8100 (relocation range: 
0x8000-0xbfff)
Rebooting in 86400 seconds..

Don't really have an easy way to reproduce it...

-- 

Thanks,
Sasha

Re: [PATCH v2 00/10] PCID and improved laziness

2017-06-14 Thread Andy Lutomirski

On Wed, Jun 14, 2017 at 3:18 PM, Dave Hansen  wrote:
> On 06/13/2017 09:56 PM, Andy Lutomirski wrote:
>> 2. Mms that have been used recently on a given CPU might get to keep
>>their TLB entries alive across process switches with this patch
>>set.  TLB fills are pretty fast on modern CPUs, but they're even
>>faster when they don't happen.
>
> Let's not forget that TLBs are also getting bigger.  The bigger TLBs
> help ensure that they *can* survive across another process's timeslice.
>
> Also, the cost to refill the paging structure caches is going up.  Just
> think of how many cachelines you have to pull in to populate a
> ~1500-entry TLB, even if the CPU hid the latency of those loads.

Then throw EPT into the mix for extra fun.  I wonder if we should try
to allocate page tables from nearby physical addresses if we think we
might be running as a guest.

Re: [PATCH v2 00/10] PCID and improved laziness

2017-06-14 Thread Dave Hansen

On 06/13/2017 09:56 PM, Andy Lutomirski wrote:
> 2. Mms that have been used recently on a given CPU might get to keep
>their TLB entries alive across process switches with this patch
>set.  TLB fills are pretty fast on modern CPUs, but they're even
>faster when they don't happen.

Let's not forget that TLBs are also getting bigger.  The bigger TLBs
help ensure that they *can* survive across another process's timeslice.

Also, the cost to refill the paging structure caches is going up.  Just
think of how many cachelines you have to pull in to populate a
~1500-entry TLB, even if the CPU hid the latency of those loads.

[PATCH v2 00/10] PCID and improved laziness

2017-06-13 Thread Andy Lutomirski

There are three performance benefits here:

1. TLB flushing is slow.  (I.e. the flush itself takes a while.)
   This avoids many of them when switching tasks by using PCID.  In
   a stupid little benchmark I did, it saves about 100ns on my laptop
   per context switch.  I'll try to improve that benchmark.

2. Mms that have been used recently on a given CPU might get to keep
   their TLB entries alive across process switches with this patch
   set.  TLB fills are pretty fast on modern CPUs, but they're even
   faster when they don't happen.

3. Lazy TLB is way better.  We used to do two stupid things when we
   ran kernel threads: we'd send IPIs to flush user contexts on their
   CPUs and then we'd write to CR3 for no particular reason as an excuse
   to stop further IPIs.  With this patch, we do neither.

This will, in general, perform suboptimally if paravirt TLB flushing
is in use (currently just Xen, I think, but Hyper-V is in the works).
The code is structured so we could fix it in one of two ways: we
could take a spinlock when touching the percpu state so we can update
it remotely after a paravirt flush, or we could be more careful about
our exactly how we access the state and use cmpxchg16b to do atomic
remote updates.  (On SMP systems without cmpxchg16b, we'd just skip
the optimization entirely.)

This is based on tip:x86/mm.  The branch is here if you want to play:
https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=x86/pcid

Changes from RFC:
 - flush_tlb_func_common() no longer gets reentered (Nadav)
 - Fix ASID corruption on unlazying (kbuild bot)
 - Move Xen init to the right place
 - Misc cleanups

Andy Lutomirski (10):
  x86/ldt: Simplify LDT switching logic
  x86/mm: Remove reset_lazy_tlbstate()
  x86/mm: Give each mm TLB flush generation a unique ID
  x86/mm: Track the TLB's tlb_gen and update the flushing algorithm
  x86/mm: Rework lazy TLB mode and TLB freshness tracking
  x86/mm: Stop calling leave_mm() in idle code
  x86/mm: Disable PCID on 32-bit kernels
  x86/mm: Add nopcid to turn off PCID
  x86/mm: Enable CR4.PCIDE on supported systems
  x86/mm: Try to preserve old TLB entries using PCID

 Documentation/admin-guide/kernel-parameters.txt |   2 +
 arch/ia64/include/asm/acpi.h|   2 -
 arch/x86/include/asm/acpi.h |   2 -
 arch/x86/include/asm/disabled-features.h|   4 +-
 arch/x86/include/asm/mmu.h  |  25 +-
 arch/x86/include/asm/mmu_context.h  |  40 ++-
 arch/x86/include/asm/processor-flags.h  |   2 +
 arch/x86/include/asm/tlbflush.h |  89 +-
 arch/x86/kernel/cpu/bugs.c  |   8 +
 arch/x86/kernel/cpu/common.c|  33 +++
 arch/x86/kernel/smpboot.c   |   1 -
 arch/x86/mm/init.c  |   2 +-
 arch/x86/mm/tlb.c   | 368 +++-
 arch/x86/xen/enlighten_pv.c |   6 +
 drivers/acpi/processor_idle.c   |   2 -
 drivers/idle/intel_idle.c   |   9 +-
 16 files changed, 429 insertions(+), 166 deletions(-)

-- 
2.9.4

Re: [PATCH v2 00/10] PCID and improved laziness

Re: [PATCH v2 00/10] PCID and improved laziness

Re: [PATCH v2 00/10] PCID and improved laziness

Re: [PATCH v2 00/10] PCID and improved laziness

[PATCH v2 00/10] PCID and improved laziness

5 matches

Site Navigation

Mail list logo

Footer information