Re: [PATCH] futex: Handle transient "ownerless" rtmutex state correctly

2020-11-04 Thread Gratian Crisan


Thomas Gleixner writes:

> From: Mike Galbraith 
>
> Gratian managed to trigger the BUG_ON(!newowner) in fixup_pi_state_owner().
> This is one possible chain of events leading to this:
>
> Task Prio   Operation
> T1   120  lock(F)
> T2   120  lock(F)   -> blocks (top waiter)
> T3   50 (RT)  lock(F)   -> boosts T3 and blocks (new top waiter)
> XXtimeout/  -> wakes T2
>   signal
> T1   50   unlock(F) -> wakes T3 (rtmutex->owner == NULL, waiter 
> bit is set)
> T2   120  cleanup   -> try_to_take_mutex() fails because T3 is the top 
> waiter
>and the lower priority T2 cannot steal the lock.
> -> fixup_pi_state_owner() sees newowner == NULL -> 
> BUG_ON()
>
> The comment states that this is invalid and rt_mutex_real_owner() must
> return a non NULL owner when the trylock failed, but in case of a queued
> and woken up waiter rt_mutex_real_owner() == NULL is a valid transient
> state. The higher priority waiter has simply not yet managed to take over
> the rtmutex.
>
> The BUG_ON() is therefore wrong and this is just another retry condition in
> fixup_pi_state_owner().
>
> Drop the locks, so that T3 can make progress, and then try the fixup again.
>
> Gratian provided a great analysis, traces and a reproducer. The analysis is
> to the point, but it confused the hell out of that tglx dude who had to
> page in all the futex horrors again. Condensed version is above. 
>
> [ tglx: Wrote comment and changelog ]
>
> Fixes: c1e2f0eaf015 ("futex: Avoid violating the 10th rule of futex")
> Reported-by: Gratian Crisan 
> Signed-off-by: Mike Galbraith 
> Signed-off-by: Thomas Gleixner 
> Cc: sta...@vger.kernel.org
> Link: 
> https://urldefense.com/v3/__https://lore.kernel.org/r/87a6w6x7bb.fsf@ni.com__;!!FbZ0ZwI3Qg!5INAsNbAVSp3jaNkkjFazSRC86BpcZnVM3-oTDYl0KijU6jA5pWYk4KI79_L5F4$
>  

LGTM, no crashes in my testing today.

-Gratian

> ---
>  kernel/futex.c |   16 ++--
>  1 file changed, 14 insertions(+), 2 deletions(-)
>
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -2380,10 +2380,22 @@ static int fixup_pi_state_owner(u32 __us
>   }
>  
>   /*
> -  * Since we just failed the trylock; there must be an owner.
> +  * The trylock just failed, so either there is an owner or
> +  * there is a higher priority waiter than this one.
>*/
>   newowner = rt_mutex_owner(&pi_state->pi_mutex);
> - BUG_ON(!newowner);
> + /*
> +  * If the higher priority waiter has not yet taken over the
> +  * rtmutex then newowner is NULL. We can't return here with
> +  * that state because it's inconsistent vs. the user space
> +  * state. So drop the locks and try again. It's a valid
> +  * situation and not any different from the other retry
> +  * conditions.
> +  */
> + if (unlikely(!newowner)) {
> + ret = -EAGAIN;
> + goto handle_err;
> + }
>   } else {
>   WARN_ON_ONCE(argowner != current);
>   if (oldowner == current) {



Re: BUG_ON(!newowner) in fixup_pi_state_owner()

2020-11-03 Thread Gratian Crisan
Hi all,

I apologize for waking up the futex demons (and replying to my own
email), but ...

Gratian Crisan writes:
>
> Brandon and I have been debugging a nasty race that leads to
> BUG_ON(!newowner) in fixup_pi_state_owner() in futex.c. So far
> we've only been able to reproduce the issue on 4.9.y-rt kernels.
> We are still testing if this is a problem for later RT branches.

I was able to reproduce the BUG_ON(!newowner) in fixup_pi_state_owner()
with a 5.10.0-rc1-rt1 kernel (currently testing 5.10.0-rc2-rt4).

The kernel branch I had set up for this test is available here:
https://github.com/gratian/linux/commits/devel-rt/v5.10-rc1-rt1-ni-serial

It has a minimal set of commits on top of devel-rt/5.10.0-rc1-rt1 to
enable the UART on this hardware and add a kgdb breakpoint (so I can
extract more information before the BUG_ON).

I've captured the reproducer, boot messages, OOPS, full event ftrace
dump and some kgdb info here: https://github.com/gratian/traces

The abbreviated OOPS (w/o the ftrace dump since it's too large to
include inline; see link[1]):

[15812.238958] [ cut here ]
[15812.238963] kernel BUG at kernel/futex.c:2399!
[15812.238974] invalid opcode:  [#1] PREEMPT_RT SMP PTI
[15812.261428] CPU: 1 PID: 1972 Comm: f_waiter Tainted: G U  W 
5.10.0-rc1-rt1-00015-g10c3af983f2e #1
[15812.271341] Hardware name: National Instruments NI cRIO-9035/NI cRIO-9035, 
BIOS 1.3.0f1 07/18/2016
[15812.280298] RIP: 0010:fixup_pi_state_owner.isra.0+0x2d7/0x380
[15812.286047] Code: 0f 85 a5 fe ff ff 48 8b 7d b8 e8 e4 fe 6b 00 85 c0 0f 85 
94 fe ff ff 4d 8b 75 28 49 83 e6 fe 0f 85 bd fd ff ff e8 f9 f2 02 00 <0f> 0b 48 
8b 45 a8 48 8b 38 e8 ab 1a 6c 00 48 8b 7d b8 e8 52 19 6c
[15812.304817] RSP: 0018:c90001877b88 EFLAGS: 00010046
[15812.310043] RAX:  RBX: 888006c52640 RCX: 0001
[15812.317177] RDX: 0078 RSI: 005c RDI: 888003e7f028
[15812.324310] RBP: c90001877c08 R08: 0001 R09: 888003e7f010
[15812.331443] R10: 01d8 R11:  R12: 56469b8770e4
[15812.338576] R13: 888003e7f000 R14:  R15: 88801f281708
[15812.345713] FS:  7f0d0f4e1700() GS:888037b0() 
knlGS:
[15812.353802] CS:  0010 DS:  ES:  CR0: 80050033
[15812.359547] CR2: 7f2f6e5830e1 CR3: 08a36000 CR4: 001006e0
[15812.366682] Call Trace:
[15812.369131]  fixup_owner+0x6a/0x70
[15812.372534]  futex_wait_requeue_pi.constprop.0+0x5a1/0x6b0
[15812.378024]  ? __hrtimer_init_sleeper+0x60/0x60
[15812.382562]  do_futex+0x515/0xc90
[15812.385879]  ? trace_buffer_unlock_commit_regs+0x40/0x1f0
[15812.391278]  ? trace_event_buffer_commit+0x7b/0x260
[15812.396158]  ? trace_event_raw_event_sys_enter+0x9a/0x100
[15812.401558]  ? _copy_from_user+0x2b/0x60
[15812.405484]  __x64_sys_futex+0x144/0x180
[15812.409410]  do_syscall_64+0x38/0x50
[15812.412986]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[15812.418037] RIP: 0033:0x7f0d0ec6b4c9
[15812.421613] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 
f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 
f0 ff ff 73 01 c3 48 8b 0d 9f 39 2b 00 f7 d8 64 89 01 48
[15812.440383] RSP: 002b:7f0d0f4e0ec8 EFLAGS: 0216 ORIG_RAX: 
00ca
[15812.447952] RAX: ffda RBX:  RCX: 7f0d0ec6b4c9
[15812.455085] RDX: 001a560b RSI: 008b RDI: 56469b8770e0
[15812.462219] RBP: 7f0d0f4e0f10 R08: 56469b8770e4 R09: 
[15812.469352] R10: 7f0d0f4e0f30 R11: 0216 R12: 
[15812.476485] R13: 7ffdd55cb86f R14:  R15: 7f0d0f56e040
[15812.483622] Modules linked in: g_ether u_ether libcomposite udc_core ipv6 
i915 mousedev hid_multitouch intel_gtt coretemp drm_kms_helper syscopyarea 
sysfillrect sysimgblt fb_sys_fops aesni_intel glue_helper crypto_simd igb drm 
i2c_i801 lpc_ich mfd_core i2c_algo_bit i2c_smbus agpgart dwc3_pci video 
backlight button [last unloaded: rng_core]
[15812.513719] Dumping ftrace buffer:
[15812.517121] -

 see:
[1] 
https://github.com/gratian/traces/blob/main/oops_ftrace-5.10.0-rc1-rt1-00015-g10c3af983f2e-dump1.txt

[16492.155124] -
[16492.159486] ---[ end trace 0003 ]---
[16492.159489] printk: enabled sync mode
[16492.167762] RIP: 0010:fixup_pi_state_owner.isra.0+0x2d7/0x380
[16492.173513] Code: 0f 85 a5 fe ff ff 48 8b 7d b8 e8 e4 fe 6b 00 85 c0 0f 85 
94 fe ff ff 4d 8b 75 28 49 83 e6 fe 0f 85 bd fd ff ff e8 f9 f2 02 00 <0f> 0b 48 
8b 45 a8 48 8b 38 e8 ab 1a 6c 00 48 8b 7d b8 e8 52 19 6c
[16492.192282] RSP: 0018:c90001877b88 EFLAGS: 00010046
[16492.197508] RAX:  RBX: 888006c52640 RCX: 0001
[16492.204642] RDX: 0078 RSI: 005c RDI: 88800

BUG_ON(!newowner) in fixup_pi_state_owner()

2020-10-28 Thread Gratian Crisan
Hi all,

Brandon and I have been debugging a nasty race that leads to
BUG_ON(!newowner) in fixup_pi_state_owner() in futex.c. So far
we've only been able to reproduce the issue on 4.9.y-rt kernels.
We are still testing if this is a problem for later RT branches.

The original reproducing app was complex and took about a month
to trigger it. However based on event traces I was able to put
together the .c reproducer below. This triggers the BUG_ON on a
2-core Intel Atom E3825 system when using 4.9.y-rt within
hours (up to a day).

Here's an attempt to describe in a pseudo trace the sequence of
events that leads to the BUG_ON. All threads start on the same
CPU (CPU1 in the linked traces). The waiter, waker and boosted
threads start as SCHED_OTHER prio:120, rt_thread is configured as
SCHED_FIFO prio:92.

waiter_thread   waker_threadboosted_thread  
rt_thread


clock_nanosleep()
futex(FUTEX_WAIT_REQUEUE_PI,
  cond, lock1, timeout)
  futex_wait_requeue_pi()
futex_wait_queue_me()
  

futex(FUTEX_LOCK_PI, lock2)

clock_nanosleep()

futex(FUTEX_LOCK_PI, lock1)
futex(FUTEX_CMP_REQUEUE_PI,
  cond, lock1)
futex(FUTEX_UNLOCK_PI, lock1)
  
 
 
  
  
  
 futex(FUTEX_LOCK_PI, lock2)

futex(FUTEX_LOCK_PI, lock1)
   

  

  

  
rt_mutex_wait_proxy_lock()

  -ETIMEDOUT// boosted 
thread gets
fixup_owner()   // delayed 
expiring
  fixup_pi_state_owner()// 
unrelated timers
// acquires wait_lock
__rt_mutex_futex_trylock()
  __rt_mutex_slowtrylock()

try_to_take_rt_mutex()// spins on 
wait_lock
  // fails to take the lock
  // the lock has no owner
  // but it has higher priority
  // waiters enqueued (i.e. boosted_thread)
BUG_ON(!newowner)

Full ftrace event dumps on OOPS are available at:
https://github.com/gratian/traces

In the same repo you'll also find info on the boot messages for
this hardware and a capture of a kgdb session with prints for the
data structures involved. 

Some important data points - at the time of the crash:

  * the q->pi_state->pi_mutex->owner is 0x1

  * the q->pi_state->pi_mutex has waiters enqueued and the
current 'waiter_thread' task is not the highest priority
waiter

  * the q->pi_state->owner is the 'waiter_thread'

You can find the kernel branch used in these tests here:
https://github.com/gratian/linux/commits/stable-rt/v4.9-rt-ni-serial

The other thing worth mentioning is the fact that the comment in
fixup_pi_state_owner() related to this:

/*
 * Since we just failed the trylock; there must be an owner.
 */
newowner = rt_mutex_owner(&pi_state->pi_mutex);
BUG_ON(!newowner);

seems incomplete since to my eye there is a code path in
__try_to_take_rt_mutex() through which the trylock can fail
because the current task is not the highest priority waiter but
the rt mutex doesn't have the owner set yet (i.e. the highest
priority waiter didn't get a chance to fix-up the owner field
yet).

I'll be happy to provide any additional info.
Thoughts and suggestions appreciated.

Thanks,
Gratian

--8<---cut here---start->8---
/* 
 * fbomb.c
 * Reproducer for BUG_ON(!newowner) in fixup_pi_state_owner()
 * /kernel/futex.c
 */
#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define TEST_CPU1
#define NSEC_PER_USEC   1000ULL
#define NSEC_PER_SEC10ULL

#define FUTEX_TID_MASK  0x3fff

typedef uint32_t futex_t;

sta

Re: [patch 1/1] Kconfig: Introduce CONFIG_PREEMPT_RT

2019-07-16 Thread Gratian Crisan


Thomas Gleixner writes:

> Add a new entry to the preemption menu which enables the real-time support
> for the kernel. The choice is only enabled when an architecture supports
> it.
>
> It selects PREEMPT as the RT features depend on it. To achieve that the
> existing PREEMPT choice is renamed to PREEMPT_LL which select PREEMPT as
> well.
>
> No functional change.
>
> Signed-off-by: Thomas Gleixner 

+1 from National Instruments. We have a vested interest in preempt_rt
and we're committed in helping support, maintain, and test it. Glad to
see this happening.

Acked-by: Gratian Crisan 

Thanks,
Gratian

> ---
>  arch/Kconfig   |3 +++
>  kernel/Kconfig.preempt |   25 +++--
>  2 files changed, 26 insertions(+), 2 deletions(-)
>
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -809,6 +809,9 @@ config ARCH_NO_COHERENT_DMA_MMAP
>  config ARCH_NO_PREEMPT
>   bool
>
> +config ARCH_SUPPORTS_RT
> + bool
> +
>  config CPU_NO_EFFICIENT_FFS
>   def_bool n
>
> --- a/kernel/Kconfig.preempt
> +++ b/kernel/Kconfig.preempt
> @@ -35,10 +35,10 @@ config PREEMPT_VOLUNTARY
>
> Select this if you are building a kernel for a desktop system.
>
> -config PREEMPT
> +config PREEMPT_LL
>   bool "Preemptible Kernel (Low-Latency Desktop)"
>   depends on !ARCH_NO_PREEMPT
> - select PREEMPT_COUNT
> + select PREEMPT
>   select UNINLINE_SPIN_UNLOCK if !ARCH_INLINE_SPIN_UNLOCK
>   help
> This option reduces the latency of the kernel by making
> @@ -55,7 +55,28 @@ config PREEMPT
> embedded system with latency requirements in the milliseconds
> range.
>
> +config PREEMPT_RT
> + bool "Fully Preemptible Kernel (Real-Time)"
> + depends on EXPERT && ARCH_SUPPORTS_RT
> + select PREEMPT
> + help
> +   This option turns the kernel into a real-time kernel by replacing
> +   various locking primitives (spinlocks, rwlocks, etc) with
> +   preemptible priority-inheritance aware variants, enforcing
> +   interrupt threading and introducing mechanisms to break up long
> +   non-preemtible sections. This makes the kernel, except for very
> +   low level and critical code pathes (entry code, scheduler, low
> +   level interrupt handling) fully preemtible and brings most
> +   execution contexts under scheduler control.
> +
> +   Select this if you are building a kernel for systems which
> +   require real-time guarantees.
> +
>  endchoice
>
>  config PREEMPT_COUNT
> bool
> +
> +config PREEMPT
> +   bool
> +   select PREEMPT_COUNT


Kernel page fault in vmalloc_fault() after a preempted ioremap

2018-03-08 Thread Gratian Crisan
Hi all,

We are seeing kernel page faults happening on module loads with certain
drivers like the i915 video driver[1]. This was initially discovered on
a 4.9 PREEMPT_RT kernel. It takes 5 days on average to reproduce using a
simple reboot loop test. Looking at the code paths involved I believe
the issue is still present in the latest vanilla kernel.

Some relevant points are:

  * x86_64 CPU: Intel Atom E3940

  * CONFIG_HUGETLBFS is not set (which also gates CONFIG_HUGETLB_PAGE)

Based on function traces I was able to gather the sequence of events is:

  1. Driver starts a ioremap operation for a region that is PMD_SIZE in
  size (or PUD_SIZE).

  2. The ioremap() operation is preempted while it's in the middle of
  setting up the page mappings:
  ioremap_page_range->...->ioremap_pmd_range->pmd_set_huge <>

  3. Unrelated tasks run. Traces also include some cross core scheduling
  IPI calls.

  4. Driver resumes execution finishes the ioremap operation and tries to
  access the newly mapped IO region. This triggers a vmalloc fault.

  5. The vmalloc_fault() function hits a kernel page fault when trying to
  dereference a non-existent *pte_ref.

The reason this happens is the code paths called from ioremap_page_range()
make different assumptions about when a large page (pud/pmd) mapping can be
used versus the code paths in vmalloc_fault().

Using the PMD sized ioremap case as an example (the PUD case is similar):
ioremap_pmd_range() calls ioremap_pmd_enabled() which is gated by
CONFIG_HAVE_ARCH_HUGE_VMAP. On x86_64 this will return true unless the
"nohugeiomap" kernel boot parameter is passed in.

On the other hand, in the rare case when a page fault happens in the
ioremap'ed region, vmalloc_fault() calls the pmd_huge() function to check
if a PMD page is marked huge or if it should go on and get a reference to
the PTE. However pmd_huge() is conditionally compiled based on the user
configured CONFIG_HUGETLB_PAGE selected by CONFIG_HUGETLBFS. If the
CONFIG_HUGETLBFS option is not enabled pmd_huge() is always defined to be
0.

The end result is an OOPS in vmalloc_fault() when the non-existent pte_ref
is dereferenced because the test for pmd_huge() failed.

Commit f4eafd8bcd52 ("x86/mm: Fix vmalloc_fault() to handle large pages
properly") attempted to fix the mismatch between ioremap() and
vmalloc_fault() with regards to huge page handling but it missed this use
case.

I am working on a simpler reproducing case however so far I've been
unsuccessful in re-creating the conditions that trigger the vmalloc fault
in the first place. Adding explicit scheduling points in
ioremap_pmd_range/pmd_set_huge doesn't seem to be sufficient. Ideas
appreciated.

Any thoughts on what a correct fix would look like? Should the ioremap
code paths respect the HUGETLBFS config or would it be better for the
vmalloc fault code paths to match the tests used in ioremap and not rely
on the HUGETLBFS option being enabled?

Thanks,
Gratian


[1]

[3.837847] BUG: unable to handle kernel paging request at 880093c0
[3.837855] IP: [] vmalloc_fault+0x1e5/0x21d
[3.837857] PGD 1f20067
[3.837857] PUD 0
[3.837858]
[3.837860] Oops:  [#1] PREEMPT SMP
[3.837880] Modules linked in: i915(+) dwc3 udc_core nichenumk(PO) 
nifslk(PO) nimdbgk(PO) niorbk(PO) nipalk(PO) intel_gtt drm_kms_helper 
syscopyarea sysfillrect sysimgblt fb_sys_fops drm igb atomicchinchk(PO) 
coretemp i2c_i801 nibds(PO) i2c_smbus nikal(PO) i2c_algo_bit dwc3_pci agpgart 
video tpm_tis backlight tpm_tis_core tpm button fuse
[3.837885] CPU: 1 PID: 238 Comm: udevd Tainted: P   O
4.9.33-rt23-5.6.0d60 #1
[3.837886] Hardware name: National Instruments NI cRIO-9042/NI cRIO-9042, 
BIOS 5.12 09/04/2017
[3.837887] task: 880179137080 task.stack: c90001154000
[3.837891] RIP: 0010:[]  [] 
vmalloc_fault+0x1e5/0x21d
[3.837892] RSP: 0018:c900011578e8  EFLAGS: 00010006
[3.837893] RAX: 3000 RBX: 880003c0 RCX: 800091f3
[3.837894] RDX: 800091f3 RSI: 8800 RDI: 3fe0
[3.837895] RBP: c900011578f8 R08: 9000 R09: 9000
[3.837896] R10: 0080 R11: 0080 R12: c90001278000
[3.837897] R13: c900011579a8 R14: 880179b3a400 R15: 0048
[3.837899] FS:  7f7b7942f880() GS:88017fc8() 
knlGS:
[3.837900] CS:  0010 DS:  ES:  CR0: 80050033
[3.837901] CR2: 880093c0 CR3: 000179207000 CR4: 003406e0
[3.837902] Stack:
[3.837906]    c90001157968 
810449f3
[3.837909]  c9000120 c9000140 8173 
0246
[3.837911]  880179b3a498 880179137080 c90001157958 

[3.837912] Call Trace:
[3.837919]  [] __do_page_fault+0x313/0x420
[3.837922]  [] do_page_fault+0x25/0x70
[3.837925]  [] ? iomem_map

Re: [PATCH] futex: Avoid violating the 10th rule of futex

2017-12-08 Thread Gratian Crisan

Peter Zijlstra writes:

> On Thu, Dec 07, 2017 at 05:02:40PM -0600, Gratian Crisan wrote:
>
>> Yep ... looks good to me. I've been running two targets with the
>> original reproducer for 8 hours now plus a target running the C test.
>> All of them are still going.
>> 
>> I'm going to let them run overnight to make sure but I'm going to call
>> it fixed.
>
> Assuming nothing bad happens; find below the patch with a Changelog
> attached.

Looks good after overnight testing. Thank you so much.

-Gratian


Re: PI futexes + lock stealing woes

2017-12-07 Thread Gratian Crisan

Julia Cartwright writes:

> On Thu, Dec 07, 2017 at 08:57:59AM -0600, Gratian Crisan wrote:
>>
>> Peter Zijlstra writes:
>> > The below compiles and boots, but is otherwise untested. Could you give
>> > it a spin?
>>
>> Thank you! Yes, I'll start a test now.
>
> I gave it a spin with my minimal reproducing case w/ the strategic
> msleep().  Previously, I could reproduce it in less than a minute in
> that setup; now it's been running for a couple hours.  So ... looks
> good? :)

Yep ... looks good to me. I've been running two targets with the
original reproducer for 8 hours now plus a target running the C test.
All of them are still going.

I'm going to let them run overnight to make sure but I'm going to call
it fixed.

-Gratian


Re: PI futexes + lock stealing woes

2017-12-07 Thread Gratian Crisan

Peter Zijlstra writes:
> The below compiles and boots, but is otherwise untested. Could you give
> it a spin?

Thank you! Yes, I'll start a test now.

-Gratian

> ---
>  kernel/futex.c  | 83 
> +
>  kernel/locking/rtmutex.c| 26 +
>  kernel/locking/rtmutex_common.h |  1 +
>  3 files changed, 87 insertions(+), 23 deletions(-)
>
> diff --git a/kernel/futex.c b/kernel/futex.c
> index 76ed5921117a..29ac5b64e7c7 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -2294,21 +2294,17 @@ static void unqueue_me_pi(struct futex_q *q)
>   spin_unlock(q->lock_ptr);
>  }
>  
> -/*
> - * Fixup the pi_state owner with the new owner.
> - *
> - * Must be called with hash bucket lock held and mm->sem held for non
> - * private futexes.
> - */
>  static int fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q,
> - struct task_struct *newowner)
> + struct task_struct *argowner)
>  {
> - u32 newtid = task_pid_vnr(newowner) | FUTEX_WAITERS;
>   struct futex_pi_state *pi_state = q->pi_state;
>   u32 uval, uninitialized_var(curval), newval;
> - struct task_struct *oldowner;
> + struct task_struct *oldowner, *newowner;
> + u32 newtid;
>   int ret;
>  
> + lockdep_assert_held(q->lock_ptr);
> +
>   raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
>  
>   oldowner = pi_state->owner;
> @@ -2317,11 +2313,17 @@ static int fixup_pi_state_owner(u32 __user *uaddr, 
> struct futex_q *q,
>   newtid |= FUTEX_OWNER_DIED;
>  
>   /*
> -  * We are here either because we stole the rtmutex from the
> -  * previous highest priority waiter or we are the highest priority
> -  * waiter but have failed to get the rtmutex the first time.
> +  * We are here because either:
> +  *
> +  *  - we stole the lock and pi_state->owner needs updating to reflect
> +  *that (@argowner == current),
>*
> -  * We have to replace the newowner TID in the user space variable.
> +  * or:
> +  *
> +  *  - someone stole our lock and we need to fix things to point to the
> +  *new owner (@argowner == NULL).
> +  *
> +  * Either way, we have to replace the TID in the user space variable.
>* This must be atomic as we have to preserve the owner died bit here.
>*
>* Note: We write the user space value _before_ changing the pi_state
> @@ -2334,6 +2336,42 @@ static int fixup_pi_state_owner(u32 __user *uaddr, 
> struct futex_q *q,
>* in the PID check in lookup_pi_state.
>*/
>  retry:
> + if (!argowner) {
> + if (oldowner != current) {
> + /*
> +  * We raced against a concurrent self; things are
> +  * already fixed up. Nothing to do.
> +  */
> + ret = 0;
> + goto out_unlock;
> + }
> +
> + if (__rt_mutex_futex_trylock(&pi_state->pi_mutex)) {
> + /* We got the lock after all, nothing to fix. */
> + ret = 0;
> + goto out_unlock;
> + }
> +
> + /*
> +  * Since we just failed the trylock; there must be an owner.
> +  */
> + newowner = rt_mutex_owner(&pi_state->pi_mutex);
> + BUG_ON(!newowner);
> + } else {
> + WARN_ON_ONCE(argowner != current);
> + if (oldowner == current) {
> + /*
> +  * We raced against a concurrent self; things are
> +  * already fixed up. Nothing to do.
> +  */
> + ret = 0;
> + goto out_unlock;
> + }
> + newowner = argowner;
> + }
> +
> + newtid = task_pid_vnr(newowner) | FUTEX_WAITERS;
> +
>   if (get_futex_value_locked(&uval, uaddr))
>   goto handle_fault;
>  
> @@ -2434,15 +2472,28 @@ static int fixup_owner(u32 __user *uaddr, struct 
> futex_q *q, int locked)
>* Got the lock. We might not be the anticipated owner if we
>* did a lock-steal - fix up the PI-state in that case:
>*
> -  * We can safely read pi_state->owner without holding wait_lock
> -  * because we now own the rt_mutex, only the owner will attempt
> -  * to change it.
> +  * Speculative pi_state->owner read (we don't hold wait_lock);
> +  * since we own the lock pi_state->owner == current is the
> +  * stable state, anything else needs more attention.
>*/
>   if (q->pi_state->owner != current)
>   ret = fixup_pi_state_owner(uaddr, q, current);
>   goto out;
>   }
>  
> + /*
> +  * If we didn't get the lock; check if anybody s

Re: PI futexes + lock stealing woes

2017-12-06 Thread Gratian Crisan

Peter Zijlstra writes:

> On Wed, Nov 29, 2017 at 11:56:05AM -0600, Julia Cartwright wrote:
>
>> fixup_owner() used to have additional seemingly relevant checks in place
>> that were removed 73d786bd043eb ("futex: Rework inconsistent
>> rt_mutex/futex_q state").
>
> *groan*... yes. I completely missed that extra case also applied to
> requeue_pi (requeue always did hurt my brain).

FWIW I have been testing for about two days now with the fixup_owner()
hunk of 73d786bd043eb ("futex: Rework inconsistent rt_mutex/futex_q
state") reverted. So far it hasn't hit the race/deadlock. It normally
takes around 8 hours to reproduce.

I've also tried Julia's msleep() trick for expanding the race window for
the last 4 hours or so of testing and it seems to be still going.

Thanks,
Gratian


Re: [RT-SUMMIT] Prague Oct. 21st

2017-10-18 Thread Gratian Crisan
Hi Thomas,

Is there additional info on the address of the venue. The wiki links to
the Czech Technical University web page, however from what I can tell
there are multiple faculty buildings in that area of Prague.

I apologize if I missed something obvious, I'm currently working with
crappy internet access.

Thanks,
Gratian

Thomas Gleixner writes:

> Folks,
>
> this is a reminder for the upcoming RT-Summit event in Prague on October
> 21st at the Czech Technical University.
>
> The schedule and logistics information are available from:
>
>   
> https://urldefense.proofpoint.com/v2/url?u=https-3A__wiki.linuxfoundation.org_realtime_events_rt-2Dsummit2017_start&d=DwIBAg&c=I_0YwoKy7z5LMTVdyO6YCiE2uzI1jjZZuIPelcSjixA&r=mokuVE37Yu4PjO25UCDfvbWMtqlc2Ei-FtAPHuBpH2s&m=KKO1frDqiZLtjMFGd0EF8S3C36P0LgYYUexqhZciMwM&s=QBwasHiI8GuHojq-lKR_2dyARJEbIlI-2LHUkb2_wNg&e=
>  
>
> If you have questions or need assistance please let us know.
>
> Looking forward to see you in Prague!
>
> Thanks,
>
>   Thomas



Re: [PATCH RT] time/hrtimer: Use softirq based wakeups for non-RT threads

2017-10-05 Thread Gratian Crisan

Sebastian Andrzej Siewior writes:

> Normal wake ups (like clock_nanosleep()) which are performed by normal
> users can easily lead to 2ms latency spikes if (enough) hrtimer wakeups
> are synchronized.
> This patch moves all hrtimers wakeups to the softirq queue unless the
> caller has a RT priority.
>
> Reported-by: Gratian Crisan 

I can confirm this patch fixes the original problem reported.

I ran an overnight test (about 30 hours total) on two platforms using
cyclictest + hrtimer stress load: configurable number of SCHED_OTHER
threads doing random clock_nanosleep() in up to 1 mS intervals.

Without this patch the max latency recorded was:
  * on dual core, Intel Atom E3825 @ 1.33GHz: 107 uS
  * on dual core, Zynq 7020 @ 667MHz: 261 uS

With this patch the max numbers drop to:
  * on dual core, Intel Atom E3825 @ 1.33GHz: 66 uS
  * on dual core, Zynq 7020 @ 667MHz: 90 uS

The max latency numbers with this patch are in-line with expectations
for these platforms.

Thank you so much,
-Gratian


latency induced by hrtimer stacking

2017-01-04 Thread Gratian Crisan
if (threads == NULL) {
fprintf(stderr, "Failed to allocate memory for %d threads\n",
g_nthreads);
return EXIT_FAILURE;
}

signal(SIGINT, sighandler);
signal(SIGHUP, SIG_IGN);

for (i = 0; i < g_nthreads; i++) {
if (pthread_create(&threads[i], NULL, test_thread, NULL) != 0) {
perror("Failed to create thread");
return EXIT_FAILURE;
}
}

while (!g_stop)
clock_nanosleep(CLOCK, 0, &t, NULL);

for (i = 0; i < g_nthreads; i++)
    pthread_join(threads[i], NULL);

return EXIT_SUCCESS;
}
---

[2]
---
>From 6e6e7c852fc8efc95dbfa6becc14b3a7846fd6f8 Mon Sep 17 00:00:00 2001
From: Gratian Crisan 
Date: Wed, 9 Nov 2016 15:43:03 -0600
Subject: [HACK][RFC] sysfs, hrtimer: Add sysfs knob to defer non-rt timers to
 softirq context

Current hrtimer implementation processes all expired timers marked as
irqsafe in the timer irq context. This includes timers created by non-rt
tasks and as a result it can create unbounded latencies for real-time tasks
waiting on an hrtimer to expire.

Provide /sys/kernel/hrtimer_defer_nonrt configuration knob which when set
results in only marking as irqsafe timers created by a rt task. The rest
will get deferred to the HRTIMER_SOFTIRQ for handling via the existing
mechanism in __hrtimer_run_queues/hrtimer_rt_defer.

Signed-off-by: Gratian Crisan 
---
 kernel/ksysfs.c   | 18 ++
 kernel/time/hrtimer.c |  4 +++-
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/kernel/ksysfs.c b/kernel/ksysfs.c
index ddef079..9fd4f50 100644
--- a/kernel/ksysfs.c
+++ b/kernel/ksysfs.c
@@ -189,6 +189,23 @@ static ssize_t rcu_normal_store(struct kobject *kobj,
 KERNEL_ATTR_RW(rcu_normal);
 #endif /* #ifndef CONFIG_TINY_RCU */
 
+extern int hrtimer_defer_nonrt;
+static ssize_t hrtimer_defer_nonrt_show(struct kobject *kobj,
+   struct kobj_attribute *attr, char *buf)
+{
+   return sprintf(buf, "%d\n", hrtimer_defer_nonrt);
+}
+static ssize_t hrtimer_defer_nonrt_store(struct kobject *kobj,
+   struct kobj_attribute *attr,
+   const char *buf, size_t count)
+{
+   if (kstrtoint(buf, 0, &hrtimer_defer_nonrt))
+   return -EINVAL;
+
+   return count;
+}
+KERNEL_ATTR_RW(hrtimer_defer_nonrt);
+
 /*
  * Make /sys/kernel/notes give the raw contents of our kernel .notes section.
  */
@@ -237,6 +254,7 @@ static struct attribute * kernel_attrs[] = {
 #ifdef CONFIG_PREEMPT_RT_FULL
&realtime_attr.attr,
 #endif
+   &hrtimer_defer_nonrt_attr.attr,
NULL
 };
 
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index d85f638..df5783a 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -106,6 +106,8 @@ static inline int hrtimer_clockid_to_base(clockid_t 
clock_id)
return hrtimer_clock_to_base_table[clock_id];
 }
 
+int hrtimer_defer_nonrt;
+
 /*
  * Functions and macros which are different for UP/SMP systems are kept in a
  * single place
@@ -1644,7 +1646,7 @@ static enum hrtimer_restart hrtimer_wakeup(struct hrtimer 
*timer)
 void hrtimer_init_sleeper(struct hrtimer_sleeper *sl, struct task_struct *task)
 {
sl->timer.function = hrtimer_wakeup;
-   sl->timer.irqsafe = 1;
+   sl->timer.irqsafe = (hrtimer_defer_nonrt) ? rt_task(task) : 1;
sl->task = task;
 }
 EXPORT_SYMBOL_GPL(hrtimer_init_sleeper);
-- 
2.10.2


Re: [PATCH RFC] clocksource: Detect a watchdog overflow

2016-04-07 Thread Gratian Crisan

John Stultz writes:

> On Tue, Mar 15, 2016 at 11:50 AM, Gratian Crisan  
> wrote:
>> The clocksource watchdog can falsely trigger and disable the main
>> clocksource when the watchdog wraps around.
>>
>> The reason is that an interrupt storm and/or high priority (FIFO/RR) tasks
>> can preempt the timer softirq long enough for the watchdog to wrap around
>> if it has a limited number of bits available by comparison to the main
>> clocksource. One observed example is on a Intel Baytrail platform where TSC
>> is the main clocksource, HPET is disabled due to a hardware bug and acpi_pm
>> gets selected as the watchdog clocksource.
>>
>> Calculate the maximum number of nanoseconds the watchdog clocksource can
>> represent without overflow and do not disqualify the main clocksource if
>> the delta since the last time we have checked exceeds the measurement
>> capabilities of the watchdog clocksource.
>
> Sorry for not getting back to you sooner on this. You managed to send
> these both out while I was at a conference and on vacation, and so
> they were deep in the mail backlog. :)

No worries, I'm actually "on the road" this week too (ELC). I appreciate
the reply.

> So I'm sympathetic to this issue, because I remember seeing similar
> problems w/ runaway SCHED_FIFO tasks w/ PREEMPT_RT.

Yeah, a runaway rt thread can easily do it. That's just bad design. In
our case it was a bit more subtle bc. it was a combination of high
priority interrupts and rt threads that would occasionally stack up to
delay the timer softirq long enough to cause the watchdog wrap.

> However, its really difficult to create a solution without opening new
> cases where bad clocksources will be mis-identified as good (which
> your solution seems to suffer as well, measuring the time past with a
> known bad clocksource can easily result in large deltas, which will be
> ignored if the watchdog has a short interval).

Fair point. Ultimately you have to trust one of the clocksources. I
guess I was naive in thinking that the main clocksource can't drift more
than what the watchdog clocksource can measure within the
WATCHDOG_INTERVAL. I'm glad I don't have to deal with hardware that
lobotomized.

Would a simple solution that exposes the config option for the
clocksource wathchdog[1] (and defaults it to on) be an acceptable
alternative? It will work for us because we test the stability of the
main clocksource - part of the hardware bring-up.

> A previous effort on this was made here, and there's a resulting
> thread that didn't come to resolution:
> https://lkml.org/lkml/2015/8/17/542

Sorry I've missed it.

> Way back I had tried to come up with an approach where if the time
> delta was large, it was divided by the watchdog interval, and then we
> just compared the remainder with the current watchdog delta to see if
> they matched (closely enough). Unfortunately this didn't work out for
> me then, but perhaps it deserves a second try?

I've entertained that idea too but I think I was trying to optimize
things too early and do everything with the mult/shift math. That first
attempt failed but I do need to try harder because it would be a better
general solution.

> thanks
> -john

Thanks,
-Gratian

[1]

>From e942ddaba439cd6711e9eed44ceae34167b864f8 Mon Sep 17 00:00:00 2001
From: Gratian Crisan 
Date: Wed, 6 Apr 2016 21:20:15 -0700
Subject: [PATCH] time: Make the clocksource watchdog user configurable

The clocksource watchdog is used to detect instabilities in the current
clocksource. This is a beneficial feature on new/unknown hardware however
it can create problems by falsely triggering when the watchdog wraps. The
reason is that an interrupt storm and/or high priority (FIFO/RR) tasks can
preempt the timer softirq long enough for the watchdog to wrap if it has a
limited number of bits available by comparison with the main clocksource.

One observed example is on a Intel Baytrail platform where TSC is the main
clocksource, HPET is disabled due to a hardware bug and acpi_pm gets
selected as the watchdog clocksource.

Provide the option to disable the clocksource watchdog for hardware where
the clocksource stability has been validated.

Signed-off-by: Gratian Crisan 
---
 arch/x86/Kconfig|  2 +-
 kernel/time/Kconfig | 12 +++-
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2dc18605..6da5d9e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -54,7 +54,7 @@ config X86
select CLKEVT_I8253
select CLKSRC_I8253 if X86_32
select CLOCKSOURCE_VALIDATE_LAST_CYCLE
-   select CLOCKSOURCE_WATCHDOG
+   select HAVE_CLOCKSOURCE_WATCHDOG
select CLONE_BACKWARDS  if X86_32
select 

[PATCH RFC] clocksource: Detect a watchdog overflow

2016-03-15 Thread Gratian Crisan
The clocksource watchdog can falsely trigger and disable the main
clocksource when the watchdog wraps around.

The reason is that an interrupt storm and/or high priority (FIFO/RR) tasks
can preempt the timer softirq long enough for the watchdog to wrap around
if it has a limited number of bits available by comparison to the main
clocksource. One observed example is on a Intel Baytrail platform where TSC
is the main clocksource, HPET is disabled due to a hardware bug and acpi_pm
gets selected as the watchdog clocksource.

Calculate the maximum number of nanoseconds the watchdog clocksource can
represent without overflow and do not disqualify the main clocksource if
the delta since the last time we have checked exceeds the measurement
capabilities of the watchdog clocksource.

Signed-off-by: Gratian Crisan 
Cc: John Stultz 
Cc: Thomas Gleixner 
---
 kernel/time/clocksource.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 56ece14..597132e 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -170,7 +170,7 @@ static void clocksource_watchdog(unsigned long data)
 {
struct clocksource *cs;
cycle_t csnow, wdnow, cslast, wdlast, delta;
-   int64_t wd_nsec, cs_nsec;
+   int64_t wd_nsec, wd_max_nsec, cs_nsec;
int next_cpu, reset_pending;
 
spin_lock(&watchdog_lock);
@@ -178,6 +178,8 @@ static void clocksource_watchdog(unsigned long data)
goto out;
 
reset_pending = atomic_read(&watchdog_reset_pending);
+   wd_max_nsec = clocksource_cyc2ns(watchdog->max_cycles, watchdog->mult,
+   watchdog->shift);
 
list_for_each_entry(cs, &watchdog_list, wd_list) {
 
@@ -216,8 +218,12 @@ static void clocksource_watchdog(unsigned long data)
if (atomic_read(&watchdog_reset_pending))
continue;
 
-   /* Check the deviation from the watchdog clocksource. */
-   if (abs(cs_nsec - wd_nsec) > WATCHDOG_THRESHOLD) {
+   /*
+* Check the deviation from the watchdog clocksource,
+* accounting for a possible watchdog overflow.
+*/
+   if (abs(cs_nsec - wd_nsec) > WATCHDOG_THRESHOLD &&
+   cs_nsec < wd_max_nsec) {
pr_warn("timekeeping watchdog on CPU%d: Marking 
clocksource '%s' as unstable because the skew is too large:\n",
smp_processor_id(), cs->name);
pr_warn("  '%s' wd_now: %llx 
wd_last: %llx mask: %llx\n",
-- 
2.7.1



[RFC][PATCH] clocksource: Detect a watchdog overflow

2016-03-03 Thread gratian . crisan
From: Gratian Crisan 

The clocksource watchdog can falsely trigger and disable the main
clocksource when the watchdog warps.

The reason is that an interrupt storm and/or high priority (FIFO/RR) tasks
can preempt the timer softirq long enough for the watchdog to warp if it
has a limited number of bits available by comparison with the main
clocksource. One observed example is on a Intel Baytrail platform where TSC
is the main clocksource, HPET is disabled due to a hardware bug and acpi_pm
gets selected as the watchdog clocksource.

Calculate the maximum number of nanoseconds the watchdog clocksource can
represent without overflow and do not disqualify the main clocksource if
the delta since the last time we have checked exceeds the measurement
capabilities of the watchdog clocksource.

Signed-off-by: Gratian Crisan 
Cc: John Stultz 
Cc: Thomas Gleixner 
---
 kernel/time/clocksource.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 56ece14..597132e 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -170,7 +170,7 @@ static void clocksource_watchdog(unsigned long data)
 {
struct clocksource *cs;
cycle_t csnow, wdnow, cslast, wdlast, delta;
-   int64_t wd_nsec, cs_nsec;
+   int64_t wd_nsec, wd_max_nsec, cs_nsec;
int next_cpu, reset_pending;
 
spin_lock(&watchdog_lock);
@@ -178,6 +178,8 @@ static void clocksource_watchdog(unsigned long data)
goto out;
 
reset_pending = atomic_read(&watchdog_reset_pending);
+   wd_max_nsec = clocksource_cyc2ns(watchdog->max_cycles, watchdog->mult,
+   watchdog->shift);
 
list_for_each_entry(cs, &watchdog_list, wd_list) {
 
@@ -216,8 +218,12 @@ static void clocksource_watchdog(unsigned long data)
if (atomic_read(&watchdog_reset_pending))
continue;
 
-   /* Check the deviation from the watchdog clocksource. */
-   if (abs(cs_nsec - wd_nsec) > WATCHDOG_THRESHOLD) {
+   /*
+* Check the deviation from the watchdog clocksource,
+* accounting for a possible watchdog overflow.
+*/
+   if (abs(cs_nsec - wd_nsec) > WATCHDOG_THRESHOLD &&
+   cs_nsec < wd_max_nsec) {
pr_warn("timekeeping watchdog on CPU%d: Marking 
clocksource '%s' as unstable because the skew is too large:\n",
smp_processor_id(), cs->name);
pr_warn("  '%s' wd_now: %llx 
wd_last: %llx mask: %llx\n",
-- 
2.7.1



Re: [RFC PATCH] tsc: synchronize TSCs on buggy Intel Xeon E5 CPUs with offset error

2015-11-19 Thread Gratian Crisan

Peter Zijlstra writes:

> On Wed, Nov 11, 2015 at 09:41:25AM -0600, Gratian Crisan wrote:
>> I also wrote a small C utility[1], with a bit of code borrowed from the
>> kernel, for reading the TSC on all CPUs. It starts a high priority
>> thread per CPU, tries to synchronize them and prints out the TSC values
>> and their offset with regards to CPU0.
>> It can be called from a SysV init shell script[2] at the beginning of
>> the boot process and right before a reboot to save the values in a file.
>
> Could you also read and print TSC_ADJUST (msr 0x3b) ? This would tell us
> if for example your BIOS messed it up.

We've gathered some more information on this:

1. We were able to confirm that the TSC_ADJUST is set for CPU0 before
the bootloader or kernel start. Using an EFI shell to read MSR 0x3b
shows:
 0: 0xAD999CCB
 1: 0x
 2: 0x
 3: 0x
 4: 0x
 5: 0x
 6: 0x
 7: 0x

2. We were also able to reproduce this behavior on a Dell Precision
T5810 workstation PC. Relevant /proc/cpuinfo below[1]. Additionally we
have observed the same behavior on an off the shelf motherboard
equipped with a Haswell-E similar to the Haswell-EP we originally saw
this on.

We are still trying to figure out what in the boot chain sets the CPU0's
TSC_ADJUST i.e. BIOS, something pre-BIOS and how to proceed from there.

Thanks,
Gratian

[1]
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model   : 63
model name  : Intel(R) Core(TM) i7-5930K CPU @ 3.50GHz
stepping: 2
microcode   : 0x27
cpu MHz : 1200.117
cache size  : 15360 KB
physical id : 0
siblings: 12
core id : 0
cpu cores   : 6
apicid  : 0
initial apicid  : 0
fpu : yes
fpu_exception   : yes
cpuid level : 15
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb 
rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology 
nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 
ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt 
tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat epb xsaveopt 
pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 
avx2 smep bmi2 erms invpcid
bogomips: 6983.91
clflush size: 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] tsc: synchronize TSCs on buggy Intel Xeon E5 CPUs with offset error

2015-11-17 Thread Gratian Crisan

Dave Hansen writes:

> On 11/09/2015 02:02 PM, Peter Zijlstra wrote:
>> On Mon, Nov 09, 2015 at 01:59:02PM -0600, gratian.cri...@ni.com wrote:
>>> The Intel Xeon E5 processor family suffers from errata[1] BT81:
>> 
>>> +#ifdef CONFIG_X86_TSC
>>> +   /*
>>> +* Xeon E5 BT81 errata: TSC is not affected by warm reset.
>>> +* The TSC registers for CPUs other than CPU0 are not cleared by a warm
>>> +* reset resulting in a constant offset error.
>>> +*/
>>> +   if ((c->x86 == 6) && (c->x86_model == 0x3f))
>>> +   set_cpu_bug(c, X86_BUG_TSC_OFFSET);
>>> +#endif
>> 
>> That's hardly a family, that's just one, Haswell server.
>
> How did you come up with that x86_model?  The document you linked to
> claimes that "Extended Model" is 0010b and "Model Number" is 1101b, so
> the x86_model you are looking for should be 0x2d.

Apologies. I've messed up. The observed behavior seemed to match the
errata and it was a Xeon E5. I've used the model number I read of the
machine exhibiting the behavior w/o properly matching it with the model
number in the errata.

In the meantime Peter Zijlstra pointed me in the right direction i.e. it
looks like the BIOS is changing the TSC_ADJUST for CPU0 but not any of
the other ones. I'll sort it out with our BIOS guys and drop this patch.

Sorry again for the confusion.
-Gratian
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] tsc: synchronize TSCs on buggy Intel Xeon E5 CPUs with offset error

2015-11-17 Thread Gratian Crisan

Peter Zijlstra writes:

> On Wed, Nov 11, 2015 at 09:41:25AM -0600, Gratian Crisan wrote:
>> I also wrote a small C utility[1], with a bit of code borrowed from the
>> kernel, for reading the TSC on all CPUs. It starts a high priority
>> thread per CPU, tries to synchronize them and prints out the TSC values
>> and their offset with regards to CPU0.
>> It can be called from a SysV init shell script[2] at the beginning of
>> the boot process and right before a reboot to save the values in a file.
>
> Could you also read and print TSC_ADJUST (msr 0x3b) ? This would tell us
> if for example your BIOS messed it up.

Good call on the TSC_ADJUST. The BIOS seems to set it for CPU0 but not
for any of the other ones. I'll bug our BIOS guys about it.
Here's how the data looks after a couple reboots:

stop  : 127385698358[0] 741784252175365[741656866477007] 
741784252175432[741656866477074] 741784252175471[741656866477113] 
741784252175349[741656866476991] 741784252175458[741656866477100] 
741784252175285[741656866476927] 741784252175501[741656866477143] 
TSC_ADJUST: -741656866477048 0 0 0 0 0 0 0 

start : 47601069657[0] 741849504816842[741801903747185] 
741849504816884[741801903747227] 741849504817004[741801903747347] 
741849504817113[741801903747456] 741849504817051[741801903747394] 
741849504816746[741801903747089] 741849504816962[741801903747305] 
TSC_ADJUST: -741801903747272 0 0 0 0 0 0 0 

stop  : 127495422447[0] 741929399169793[741801903747346] 
741929399169821[741801903747374] 741929399169739[741801903747292] 
741929399169767[741801903747320] 741929399169657[741801903747210] 
741929399169612[741801903747165] 741929399169679[741801903747232] 
TSC_ADJUST: -741801903747272 0 0 0 0 0 0 0 

start : 47522880051[0] 741994508088208[741946985208157] 
741994508088258[741946985208207] 741994508088305[741946985208254] 
741994508088110[741946985208059] 741994508088052[741946985208001] 
741994508088020[741946985207969] 741994508087930[741946985207879] 
TSC_ADJUST: -741946985208111 0 0 0 0 0 0 0

Thanks a lot for helping with this,
-Gratian
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] tsc: synchronize TSCs on buggy Intel Xeon E5 CPUs with offset error

2015-11-11 Thread Gratian Crisan

Josh Hunt writes:

> On Tue, Nov 10, 2015 at 1:47 PM, Gratian Crisan  wrote:
>>
>> The observed behavior does seem to match BT81 errata i.e. the TSC does
>> not get reset on warm reboots and it is otherwise stable.
>>
> If you have a simple testcase to reproduce the problem I'd be
> interested in seeing it.

We have first hit this bug on a 4.1 PREEMPT_RT kernel where it actually
causes a boot hang on warm reboots. I haven't quite got to the bottom of
why the TSC having a large offset vs. CPU0 would cause the hang yet. I
have some stack traces around somewhere I can dig up.

I also wrote a small C utility[1], with a bit of code borrowed from the
kernel, for reading the TSC on all CPUs. It starts a high priority
thread per CPU, tries to synchronize them and prints out the TSC values
and their offset with regards to CPU0.
It can be called from a SysV init shell script[2] at the beginning of
the boot process and right before a reboot to save the values in a file.

I've pasted the results after 3 reboots [3]. You can see the CPU0's TSC
getting reset on reboot and the other cores happily ticking on throughout
the reboot.

-Gratian

[1] read-tsc.c
--8<---cut here---start->8---
#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define DECLARE_ARGS(val, low, high)unsigned low, high
#define EAX_EDX_VAL(val, low, high) ((low) | ((uint64_t)(high) << 32))
#define EAX_EDX_ARGS(val, low, high)"a" (low), "d" (high)
#define EAX_EDX_RET(val, low, high) "=a" (low), "=d" (high)

static int thread_sync;
static unsigned long long *tsc_data = NULL;

#define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))

static inline void rep_nop(void)
{
asm volatile("rep; nop" ::: "memory");
}

static inline void cpu_relax(void)
{
rep_nop();
}

static inline unsigned long long rdtsc_ordered(void)
{
DECLARE_ARGS(val, low, high);

asm volatile("lfence" : : : "memory");
asm volatile("rdtsc" : EAX_EDX_RET(val, low, high));

return EAX_EDX_VAL(val, low, high);
}

static void* threadfn(void *param)
{
long cpu = (long)param;
cpu_set_t mask;
struct sched_param schedp;

CPU_ZERO(&mask);
CPU_SET(cpu, &mask);
if (sched_setaffinity(0, sizeof(mask), &mask) == -1) {
perror("error: Failed to set the CPU affinity");
return NULL;
}

/*
 * Set the thread priority just below the migration thread's. The idea
 * is to minimize the chances of being preempted while running the test.
 */
memset(&schedp, 0, sizeof(schedp));
schedp.sched_priority = sched_get_priority_max(SCHED_FIFO) - 1;
if (sched_setscheduler(0, SCHED_FIFO, &schedp) == -1) {
perror("error: Failed to set the thread priority");
return NULL;
}

__sync_sub_and_fetch(&thread_sync, 1);
while (ACCESS_ONCE(thread_sync))
cpu_relax();

tsc_data[cpu] = rdtsc_ordered();

return NULL;
}

int main(int argc, char* argv[])
{
long i;
unsigned long n_cpus;
pthread_t *th = NULL;
int ret = EXIT_SUCCESS;

n_cpus = sysconf(_SC_NPROCESSORS_ONLN);
thread_sync = n_cpus;
__sync_synchronize();

tsc_data = (unsigned long long*)malloc(n_cpus *
   sizeof(unsigned long long));
if (!tsc_data) {
fprintf(stderr, "error: Failed to allocate memory for TSC 
data\n");
ret = EXIT_FAILURE;
goto out;
}

th = (pthread_t *)malloc(n_cpus * sizeof(pthread_t));
if (!th) {
fprintf(stderr, "error: Failed to allocate memory for thread 
data\n");
ret = EXIT_FAILURE;
goto out;
}

for (i = 0; i < n_cpus; i++)
pthread_create(&th[i], NULL, threadfn, (void*)i);   

for (i = 0; i < n_cpus; i++)
pthread_join(th[i], NULL);

if (argc > 1)
printf("%s: ", argv[1]);
for (i = 0; i < n_cpus; i++)
printf("%llu[%lld] ", tsc_data[i], tsc_data[i] - tsc_data[0]);
printf("\n");

  out:
free(tsc_data);
free(th);

return ret;
}
--8<---cut here---end--->8---


[2] /etc/init.d/save-tsc
--8<---cut here---start->8---
#!/bin/sh

read-tsc "$1" >> /tsc.dat
exit 0
--8<---cut here---end-

Re: [RFC PATCH] tsc: synchronize TSCs on buggy Intel Xeon E5 CPUs with offset error

2015-11-10 Thread Gratian Crisan

Josh Hunt writes:

> On Tue, Nov 10, 2015 at 12:24 PM, Josh Hunt  wrote:
>>
>> On Mon, Nov 9, 2015 at 4:02 PM, Peter Zijlstra  wrote:
>>>
>>> On Mon, Nov 09, 2015 at 01:59:02PM -0600, gratian.cri...@ni.com wrote:
>>>
>>> > The Intel Xeon E5 processor family suffers from errata[1] BT81:
>>>
>>> > +#ifdef CONFIG_X86_TSC
>>> > + /*
>>> > +  * Xeon E5 BT81 errata: TSC is not affected by warm reset.
>>> > +  * The TSC registers for CPUs other than CPU0 are not cleared by a 
>>> > warm
>>> > +  * reset resulting in a constant offset error.
>>> > +  */
>>> > + if ((c->x86 == 6) && (c->x86_model == 0x3f))
>>> > + set_cpu_bug(c, X86_BUG_TSC_OFFSET);
>>> > +#endif
>>>
>>> That's hardly a family, that's just one, Haswell server.
>>
>>
>> Are you actually observing this problem on this processor?
>>
>> The document you've referenced and the x86_model # above do not match up. 
>> The errata should be for Intel processors with an x86_model value of 0x2d by 
>> my calculations:
>>
>> Model: 1101b
>> Extended Model: 0010b
>>
>> The calc from cpu_detect() is:
>>  if (c->x86 >= 0x6)
>> c->x86_model += ((tfms >> 16) & 0xf) << 4;
>>
>> 0x3f is a different CPU.

The processor I am seeing the issue on is (according to /proc/cpuinfo):
vendor_id   : GenuineIntel
cpu family  : 6
model   : 63
model name  : Intel(R) Xeon(R) CPU E5-2618L v3 @ 2.30GHz
stepping: 2
microcode   : 0x2e

The observed behavior does seem to match BT81 errata i.e. the TSC does
not get reset on warm reboots and it is otherwise stable.

However you are correct in pointing out that the errata CPU model number
does not match. My apologies, decoding the x86 cpu info/model numbers is
a new area for me and the title of the errata made it sound like it
applies to the whole Intel Xeon E5 family. It was only in trying to
reply to Peter's comment that I've noticed the discrepancy with regards
to the model number.

I am currently trying to figure out if there is an errata that
specifically lists Xeon E5-2618L with TSC problems on warm resets and I
will re-work this.

Sorry again for not double checking.

-Gratian
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH] tsc: synchronize TSCs on buggy Intel Xeon E5 CPUs with offset error

2015-11-09 Thread gratian . crisan
From: Gratian Crisan 

The Intel Xeon E5 processor family suffers from errata[1] BT81:
"TSC is Not Affected by Warm Reset.
Problem: The TSC (Time Stamp Counter MSR 10H) should be cleared on reset.
Due to this erratum the TSC is not affected by warm reset.
Implication: The TSC is not cleared by a warm reset. The TSC is cleared by
power-on reset as expected. Intel has not observed any functional failures
due to this erratum.
Workaround: None identified.
Status: There are no plans to fix this erratum."

The observed behavior: after a warm reset the TSC gets reset for CPU0 but
not for any of the other cores i.e. they continue incrementing from the
value they had before the reset. The TSCs are otherwise stable and
always-running so the offset error stays constant.

Add x86 bug flag if an Intel Xeon E5 gets detected and based on that
synchronize the TSC offset by performing the following measurement:

target source
  t0 ---\
 \-->
 ts
 /--
  t1 <--/

Where:
  * target is the target CPU who's TSC offset we are trying to correct;
  * source is the source CPU used as a reference (i.e. the boot CPU);
  * t0, t1 are TSC time-stamps obtained on the target CPU;
  * ts is the time-stamp acquired on the source CPU.

If the source and target CPU TSCs are synchronized, and the interconnect is
symmetric, then ts falls exactly half-way between t0 and t1. In practice
the measured offset will include the RDTSC instruction latency as well as
the latency introduced by the interconnect. To account for these latencies
we are performing the offset measurement in a loop and use for correction
the minimum measured offset; the idea being that it contains the least
amount of unaccounted for latency. The minimum measured offset is then used
to adjust the TSC register on the target CPU.

Signed-off-by: Gratian Crisan 

[1] 
http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e5-family-spec-update.pdf
---
 arch/x86/include/asm/cpufeature.h |   1 +
 arch/x86/kernel/cpu/intel.c   |   9 +++
 arch/x86/kernel/tsc_sync.c| 124 ++
 3 files changed, 134 insertions(+)

diff --git a/arch/x86/include/asm/cpufeature.h 
b/arch/x86/include/asm/cpufeature.h
index e4f8010..3fb0b62 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -272,6 +272,7 @@
 #define X86_BUG_FXSAVE_LEAKX86_BUG(6) /* FXSAVE leaks FOP/FIP/FOP */
 #define X86_BUG_CLFLUSH_MONITORX86_BUG(7) /* AAI65, CLFLUSH required 
before MONITOR */
 #define X86_BUG_SYSRET_SS_ATTRSX86_BUG(8) /* SYSRET doesn't fix up SS 
attrs */
+#define X86_BUG_TSC_OFFSET X86_BUG(9) /* CPU has skewed but stable TSCs */
 
 #if defined(__KERNEL__) && !defined(__ASSEMBLY__)
 
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 209ac1e..42732dc 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -296,6 +296,15 @@ static void intel_workarounds(struct cpuinfo_x86 *c)
 #else
 static void intel_workarounds(struct cpuinfo_x86 *c)
 {
+#ifdef CONFIG_X86_TSC
+   /*
+* Xeon E5 BT81 errata: TSC is not affected by warm reset.
+* The TSC registers for CPUs other than CPU0 are not cleared by a warm
+* reset resulting in a constant offset error.
+*/
+   if ((c->x86 == 6) && (c->x86_model == 0x3f))
+   set_cpu_bug(c, X86_BUG_TSC_OFFSET);
+#endif
 }
 #endif
 
diff --git a/arch/x86/kernel/tsc_sync.c b/arch/x86/kernel/tsc_sync.c
index 78083bf..0d0f40c 100644
--- a/arch/x86/kernel/tsc_sync.c
+++ b/arch/x86/kernel/tsc_sync.c
@@ -114,6 +114,124 @@ static inline unsigned int loop_timeout(int cpu)
 }
 
 /*
+ * Read the current TSC counter value excluding time-stamps that are zero.
+ * Zero is treated as a special measurement synchronization value in the TSC
+ * offset synchronization code.
+ */
+static inline unsigned long long get_cycles_nz(void)
+{
+   unsigned long long ts;
+again:
+   ts = rdtsc_ordered();
+   if (unlikely(!ts))
+   goto again;
+   return ts;
+}
+
+static atomic64_t target_t0;
+static atomic64_t target_t1;
+static atomic64_t source_ts;
+/*
+ * Measure the TSC offset for the target CPU being brought up vs. the source
+ * CPU. We are collecting three time-stamps:
+ *
+ * target source
+ *   t0 ---\
+ *  \-->
+ *  ts
+ *  /--
+ *   t1 <--/
+ *
+ * If the source and target TSCs are synchronized, and the interconnect is
+ * symmetric, then ts falls exactly half-way between t0 and t1. We are 
returning
+ * any deviation from [t0..t1] mid-point as the offset of the target TSC vs. 
the
+ * source TSC. The measured offset will contain errors like the latency of 
RDTSC
+ * instruction and the latency introduced by the interconnect. Multiple
+ * measurements are required to filter out these errors.
+ */
+static s64 target_tsc_offset(void)
+{
+