On Tue, 20 Feb 2018, Tariq Toukan wrote:

> Hi Thomas,
> 
> We started seeing new issues in our net-device daily regression tests.
> They are related to patch [1] introduced in kernel 4.15-rc1.
> 
> We frequently see a warning in dmesg [2]. Repro is not consistent, we tried to
> narrow it down to a smaller run but couldn't.
> 
> In addition, sometimes (less frequent) the warning is followed by a panic [3].

Which might be just a consequence of the issue which triggers the warning.

> I can share all needed details to help analyze this bug.
> If you suspect specific flows, we can do an educated narrow down.

> [2]
> [ 8664.868564] WARNING: CPU: 5 PID: 0 at kernel/irq/matrix.c:370 
> irq_matrix_free+0x30/0xd0
> [ 8664.910144] RIP: 0010:irq_matrix_free+0x30/0xd0
> [ 8664.912624] RSP: 0018:ffff88023fd43f70 EFLAGS: 00010002
> [ 8664.915149] RAX: 0000000000026318 RBX: ffff880157a77ec0 RCX: 
> 0000000000000000
> [ 8664.917679] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 
> ffff880237038400
> [ 8664.920244] RBP: ffff880237038400 R08: 00000000e8ba3c69 R09: 
> 0000000000000000
> [ 8664.922813] R10: 00000000000003ff R11: 0000000000000ad9 R12: 
> ffff88023fc40000
> [ 8664.925345] R13: 0000000000000000 R14: 0000000000000001 R15: 
> 000000000000002b
> [ 8664.927872] FS:  0000000000000000(0000) GS:ffff88023fd40000(0000) 
> knlGS:0000000000000000
> [ 8664.930455] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 8664.932996] CR2: 0000000000f2c030 CR3: 000000000220a000 CR4: 
> 00000000000006e0
> [ 8664.935557] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
> 0000000000000000
> [ 8664.938051] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
> 0000000000000400
> [ 8664.940541] Call Trace:
> [ 8664.942980]  <IRQ>
> [ 8664.945399]  free_moved_vector+0x4e/0x100
> [ 8664.947787]  smp_irq_move_cleanup_interrupt+0x89/0x9e
> [ 8664.950134]  irq_move_cleanup_interrupt+0x95/0xa0
> [ 8664.952480]  </IRQ>

So thats:

        if (WARN_ON_ONCE(bit < m->alloc_start || bit >= m->alloc_end))
                return;

and the disassembly of

> [ 8664.972909] Code: 41 56 41 89 f6 41 55 41 89 d5 89 f2 41 54 4c 8b 24 d5 60
> 24 18 82 55 48 89 fd 53 48 8b 47 28 44 39 6f 04 77 06 44 3b 6f 08 72 0b <0f>
> ff 5b 5d 41 5c 41 5d 41 5e c3 49 01 c4 41 80 7c 24 0c 00 74

gives:

  1f:   44 39 6f 04             cmp    %r13d,0x4(%rdi)
  23:   77 06                   ja     0x2b
  25:   44 3b 6f 08             cmp    0x8(%rdi),%r13d
  29:   72 0b                   jb     0x36
  2b:*  0f ff                   (bad)           <-- trapping instruction

So the vector to release should be in r13d, which is 0. /me scratches head.

> [ 8943.038767] BUG: unable to handle kernel paging request at 000000037a6b561b
> [ 8943.040114] IP: free_moved_vector+0x61/0x100
> [ 8943.054654] RIP: 0010:free_moved_vector+0x61/0x100
> [ 8943.055940] RSP: 0018:ffff88023fd43fa0 EFLAGS: 00010007
> [ 8943.057233] RAX: 000000037a6b561b RBX: ffff880157a77ec0 RCX: 
> 0000000000000001
> [ 8943.058506] RDX: 00000000000155a8 RSI: 00000000000155a8 RDI: 
> ffff880237038400
> [ 8943.059784] RBP: ffff880157a77ec0 R08: 00000000e8ba3c69 R09: 
> 0000000000000000
> [ 8943.061051] R10: 0000000000000000 R11: 0000000000000000 R12: 
> 000000007f0c0001
> [ 8943.062462] R13: 00000000000155a8 R14: 0000000000000001 R15: 
> 0000000000cc620d
> [ 8943.063726] FS:  0000000000000000(0000) GS:ffff88023fd40000(0000) 
> knlGS:0000000000000000
> [ 8943.064993] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 8943.066253] CR2: 000000037a6b561b CR3: 000000010badc000 CR4: 
> 00000000000006e0
> [ 8943.067522] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
> 0000000000000000
> [ 8943.068771] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
> 0000000000000400
> [ 8943.070029] Call Trace:
> [ 8943.071273]  <IRQ>
> [ 8943.072503]  smp_irq_move_cleanup_interrupt+0x89/0x9e
> [ 8943.073794]  irq_move_cleanup_interrupt+0x95/0xa0
> [ 8943.075048]  </IRQ>
...
> [ 8943.092659] Code: 44 00 00 48 8b 3d c8 f7 9f 01 44 89 f1 44 89 e2 44 89 ee
> e8 e2 05 0b 00 48 c7 c0 20 50 01 00 4a 8d 04 e0 4a 03 04 ed 60 24 18 82 <48>
> c7 00 00 00 00 00 48 8b 45 28 48 85 c0 74 20 48 8b 55 20 48

 17:*   48 c7 00 00 00 00 00    movq   $0x0,(%rax)              <-- trapping 
instruction

I can't tell exactly which part of free_moved_vector() that is and which
vector it tries to free, but its more or less the same crap just with a
fatal vector number.

Is there CPU hotplugging in play?

I'll come back to you tomorrow with a plan how to debug that after staring
into the code some more.

Thanks,

        tglx

Reply via email to