Hi,

Giovanni ran into a pretty reproducible situation in which the libmicro 
benchmark[1]
shows a functional regression in sysv semaphores, on upstream kernels. 
Specifically
for the 'cascade_cond' and 'cascade_flock' programs, which exhibit hangs in 
libc's
semop() blocked waiting for zero. Alternatively, the following splats may 
appear:

[  692.991258] BUG: unable to handle kernel NULL pointer dereference (null)
[  692.992062] IP: [<ffffffff812a0a9f>] unmerge_queues+0x2f/0x70
[  692.992062] PGD 862fab067 PUD 858bbc067 PMD 0
[  692.992062] Oops: 0000 [#1] SMP
[  692.992062] Modules linked in: ...
[  692.992062] CPU: 18 PID: 7398 Comm: cascade_flock Tainted: G            E   
4.6.0-juancho2-default+ #18
[  692.992062] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS 
GRNDSDP1.86B.0030.R03.1405061547 05/06/2014
[  692.992062] task: ffff88084a7e9640 ti: ffff880854748000 task.ti: 
ffff880854748000
[  692.992062] RIP: 0010:[<ffffffff812a0a9f>]  [<ffffffff812a0a9f>] 
unmerge_queues+0x2f/0x70
[  692.992062] RSP: 0018:ffff88085474bce8  EFLAGS: 00010216
[  692.992062] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88086cc3d0d0
[  692.992062] RDX: ffff88086cc3d0d0 RSI: ffff88086cc3d0d0 RDI: ffff88086cc3d040
[  692.992062] RBP: ffff88085474bce8 R08: 0000000000000007 R09: ffff88086cc3d088
[  692.992062] R10: 0000000000000000 R11: 000000a1597ea64c R12: ffff88085474bd90
[  692.992062] R13: ffff88086cc3d040 R14: 0000000000000000 R15: 00000000ffffffff
[  692.992062] FS:  00007faa46a2d700(0000) GS:ffff88086e500000(0000) 
knlGS:0000000000000000
[  692.992062] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  692.992062] CR2: 0000000000000000 CR3: 0000000862faa000 CR4: 00000000001406e0
[  692.992062] Stack:
[  692.992062]  ffff88085474bf38 ffffffff812a2ac3 ffffffff810c3995 
ffff88084a7e9640
[  692.992062]  0000000000000000 ffffffff81cb1f48 0000000000000002 
ffff880400038000
[  692.992062]  ffff88084a7e9640 ffff88085474bd40 fffffffffffffffc 
ffff88085474bd40
[  692.992062] Call Trace:
[  692.992062]  [<ffffffff812a2ac3>] SYSC_semtimedop+0x833/0xc00
[  692.992062]  [<ffffffff810c3995>] ? __wake_up_common+0x55/0x90
[  692.992062]  [<ffffffff811f25c0>] ? kmem_cache_alloc+0x1e0/0x200
[  692.992062]  [<ffffffff81266b8b>] ? locks_alloc_lock+0x1b/0x70
[  692.992062]  [<ffffffff81266f33>] ? locks_insert_lock_ctx+0x93/0xa0
[  692.992062]  [<ffffffff81268594>] ? flock_lock_inode+0xf4/0x220
[  692.992062]  [<ffffffff81269cd7>] ? locks_lock_inode_wait+0x47/0x160
[  692.992062]  [<ffffffff811f25c0>] ? kmem_cache_alloc+0x1e0/0x200
[  692.992062]  [<ffffffff81266b8b>] ? locks_alloc_lock+0x1b/0x70
[  692.992062]  [<ffffffff81266d0f>] ? locks_free_lock+0x4f/0x60
[  692.992062]  [<ffffffff812a3340>] SyS_semop+0x10/0x20
[  692.992062]  [<ffffffff81639c32>] entry_SYSCALL_64_fastpath+0x1a/0xa4
[  692.992062] Code: 00 55 8b 47 7c 48 89 e5 85 c0 75 53 48 8b 4f 48 4c 8d 4f 48 4c 
39 c9 48 8b 11 48 89 ce 75 08 eb 36 48 89 d1 48 89 c2 48 8b 41 28 <0f> b7 00 48 
c1 e0 06 48 03 47 40 4c 8b 40 18 48 89 70 18 48 83
[  692.992062] RIP  [<ffffffff812a0a9f>] unmerge_queues+0x2f/0x70
[  692.992062]  RSP <ffff88085474bce8>
[  692.992062] CR2: 0000000000000000
[  693.882179] ---[ end trace 5605f108ab79cdb2 ]---

Or,

[  463.567641] BUG: unable to handle kernel paging request at fffffffffffffffa
[  463.576246] IP: [<ffffffff8126dcbf>] perform_atomic_semop.isra.5+0xcf/0x170
[  463.584553] PGD 1c0d067 PUD 1c0f067 PMD 0
[  463.590071] Oops: 0000 [#1] SMP
[  463.594667] Modules linked in: ...
[  463.664710] Supported: Yes
[  463.668682] CPU: 6 PID: 2912 Comm: cascade_cond Not tainted 4.4.3-29-default 
#1
[  463.677230] Hardware name: SGI.COM C2112-4GP3/X10DRT-P, BIOS 1.0b 04/07/2015
[  463.685588] task: ffff88105dba0b40 ti: ffff8808fc7e0000 task.ti: 
ffff8808fc7e0000
[  463.694366] RIP: 0010:[<ffffffff8126dcbf>]  [<ffffffff8126dcbf>] 
perform_atomic_semop.isra.5+0xcf/0x170
[  463.705084] RSP: 0018:ffff8808fc7e3c60  EFLAGS: 00010217
[  463.711610] RAX: 0000000000000000 RBX: ffff88085d22dae0 RCX: 000000005732f1e7
[  463.719952] RDX: fffffffffffffffa RSI: ffff88085d22dad0 RDI: ffff88085d22da80
[  463.728270] RBP: 0000000000000000 R08: 00000000fffffff7 R09: 0000000000000000
[  463.736561] R10: 0000000000000000 R11: 0000000000000206 R12: ffff88085d22da88
[  463.745001] R13: ffff88085d22dad0 R14: ffffffffffffffc0 R15: ffff88085d22da40
[  463.753450] FS:  00007f30fd9e5700(0000) GS:ffff88085fac0000(0000) 
knlGS:0000000000000000
[  463.762684] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  463.769731] CR2: fffffffffffffffa CR3: 000000017bc09000 CR4: 00000000001406e0
[  463.778039] Stack:
[  463.781130]  ffff8808fc7e3d50 ffff88085d22dad0 ffffffff8126dfe1 
ffffffff4d226800
[  463.789704]  ffff88085d22da80 0000000000000001 ffff88085d22da88 
0000000000000001
[  463.798254]  0000000000000001 ffff88085d22da40 ffff8808fc7e3d50 
ffff8808fc7e3da0
[  463.806758] Call Trace:
[  463.810305]  [<ffffffff8126dfe1>] update_queue+0xa1/0x180
[  463.816706]  [<ffffffff8126eb95>] do_smart_update+0x45/0xf0
[  463.823276]  [<ffffffff8126f141>] SYSC_semtimedop+0x3d1/0xb00
[  463.830035]  [<ffffffff815cf66e>] entry_SYSCALL_64_fastpath+0x12/0x71
[  463.838608] DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x12/0x71
[  463.846331]
[  463.848747] Leftover inexact backtrace:
[  463.848747]
[  463.855853] Code: 80 00 00 81 f9 ff ff 00 00 0f 87 98 00 00 00 66 45 89 0b 
48 83 c2 06 44
89 00 48 39 ea 72 99 48 83 ea 06 8b 4e 20 49 39 d2 77 16 <0f> b7 02 48 83 ea 06 
48 c1 e0 06
 48 03 07 49 39 d2 89 48 04 76
 [  463.877668] RIP  [<ffffffff8126dcbf>] perform_atomic_semop.isra.5+0xcf/0x170
 [  463.885725]  RSP <ffff8808fc7e3c60>
 [  463.890145] CR2: fffffffffffffffa
 [  463.894338] ---[ end trace 0b29cae12f0e401c ]---

From both I've reach the same conclusion that the pending operations array is 
getting
corrupted (sop, struct sembuf), ie: for the second splat, being for 
perform_atomic_semop():

 2b:*     0f b7 02              movzwl (%rdx),%eax        <-- trapping 
instruction

             sma->sem_base[sop->sem_num].sempid = pid;

 2e:      48 83 ea 06           sub    $0x6,%rdx
 32:     48 c1 e0 06            shl    $0x6,%rax
 36:     48 03 07              add    (%rdi),%ra
 39:      49 39 d2              cmp    %rdx,%r10
 3c:      89 48 04              mov    %ecx,0x4(%rax)
 3f:     76                          .byte 0x76

libc's semop()'s mainly distributes simple and complex ops on the set fairly 
evenly acting
on a unique set, ie:

      semop(semid: 884736, tsops: 0x7fffd1567bc0, nsops: 1);
      semop(semid: 884736, tsops: 0x7fffd1567bc0, nsops: 2);

Given that this suggests a broken sem_lock(), I gave -next's c4b7bb08c295 
(ipc/sem.c: Fix
complex_count vs. simple op race) a try, but unfortunately does not fix the 
issue. In fact
the regression is bisectable to the introduction of qspinlocks -- as of v4.2), 
with:

1bf7067 Merge branch 'locking-core-for-linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Considering how sem_lock() plays games with spin_is_locked() and 
spin_unlock_wait() to
enable finer grained locking (as opposed to the whole set), the regression 
seems to be
introduced due to the fact that spin_unlock_wait() with qspinlocks only checks 
the first
least-signficant byte, therefore ignoring pending waiters. Fair enough, given 
that a simple
write to that byte is enough to release the lock. However, this is semantically 
different to
what was previously done with ticket locks in that spin_unlock_wait() will 
always observe
all waiters by adding itself to the tail. For sysvsems this could cause 
sem_wait_array() to
possibly miss any pending waiters on the sem->lock when a thread is trying to 
acquire the
global lock, which could iterate over that specific lock in the semaphore set, 
and shortly
thereafter the pending waiter takes the already iterated semaphore.

As such, the following restores the behavior of the ticket locks and 'fixes'
(or hides?) the bug in sems. Naturally incorrect approach:

@@ -290,7 +290,8 @@ static void sem_wait_array(struct sem_array *sma)

        for (i = 0; i < sma->sem_nsems; i++) {
                sem = sma->sem_base + i;
-               spin_unlock_wait(&sem->lock);
+               while (atomic_read(&sem->lock))
+                       cpu_relax();
        }
        ipc_smp_acquire__after_spin_is_unlocked();
}

While the differences between both versions wrt unlock_wait() are certainly 
there, ultimately,
I agree that code should not rely on the semantics of spinlock waiters -- and 
therefore sems
need fixing. Note of course lockref is the exception to this in how 
queued_spin_value_unlocked()
is implemented. In addition, this makes me wonder if queued_spin_is_locked() 
should then be:

-       return atomic_read(&lock->val);
+       return atomic_read(&lock->val) & _Q_LOCKED_MASK;

And avoid considering pending waiters as locked.

Thoughts?

Thanks,
Davidlohr

[1] https://hg.java.net/hg/libmicro~hg-repo

Reply via email to