Vladislav Bolkhovitin wrote:
** Sometimes the benchmark "zombied" (process doing no work, but process can't be killed) after running a certain amount of time. However, it wasn't repeatable in a reliable way, so I mark that this particular run has zombied before.

That means that there is a bug somewhere. Usually such bugs are found in few hours of code auditing (srpt driver is pretty simple) or by using kernel debug facilities (example diff to .config attached). I personally always prefer put my effort on fixing real things, not inventing various workarounds, like srpt_thread in this case.

So I would:

  1. Completely remove srpt thread and all related code. It doesn't do
anything, which can't be done in SIRQ context (tasklet)

2. Audit the code to check if it does any action, which it shouldn't do on SIRQ and fix it. This step isn't required, but usually it saves a lot of time of puzzled debugging in the future.

  3. Change in srpt_handle_rdma_comp() and  srpt_handle_new_iu()
SCST_CONTEXT_THREAD to SCST_CONTEXT_DIRECT_ATOMIC.

I also changed it in srpt_handle_err_comp()

Then I would run the problematic tests (heavy tpc-h workload, e.g.) on debug kernel and fix found problems.

Anyway, Cameron, can you get the latest code from SCST trunk and try with it? It was recently updated. Also please add the case with changes from (3) above.
This is all with version 1.0.1 of SCST (v532).
In my fio test, I do runs with srpt thread=1 and then =0. When it was set to zero during the test, I got many errors printed out by FIO, and the target eventually crashed. This is the first part of a long call trace.

NMI Watchdog detected LOCKUP on CPU 0
CPU 0
Modules linked in: ib_srpt(U) scst_vdisk(U) scst(U) fio_driver(PU) fio_port(PU) autofs4 hidp rfcomm l2cap bluetooth sunrpc ib_ipoib mlx4_ib ib_cm ib_sa ib_mad ib_core ipv6 xfrm_nalgo crypto_api nls_utf8 hfsplus dm_mirror dm_multipath dm_mod video sbs backlight i2c_ec button battery asus_acpi acpi_memhotplug ac parport_pc lp parport i2c_i801 shpchp i2c_core e1000e mlx4_core i5000_edac edac_mc pcspkr ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 25732, comm: scsi_tgt0 Tainted: P      2.6.18-92.1.13.el5 #1
RIP: 0010:[<ffffffff80064bcb>] [<ffffffff80064bcb>] .text.lock.spinlock+0x29/0x30
RSP: 0018:ffffffff80418a88  EFLAGS: 00000086
RAX: ffff810785307fd8 RBX: ffffffff884e68a0 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffffffff884e68a0
RBP: ffffffff884e62a0 R08: ffff810790926900 R09: ffff8107909268e8
R10: 0000000000000018 R11: ffffffff884fcab3 R12: 0000000000000001
R13: 0000000000000001 R14: 0000000000000000 R15: ffff8107f0f374c0
FS:  0000000000000000(0000) GS:ffffffff803a0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00000037bc0986d0 CR3: 0000000000201000 CR4: 00000000000006e0
Process scsi_tgt0 (pid: 25732, threadinfo ffff810785306000, task ffff810810852100)
Stack:  0000000000000000 ffffffff884c509d ffff8107909268e8 ffff810790926900
00000002071dd688 0000020000000220 0000000000000200 00000000da984c08
0000000000000000 ffff8107909267f0 ffff810806ceee20 0000000000000001
Call Trace:
<IRQ>  [<ffffffff884c509d>] :scst:sgv_pool_alloc+0x10c/0x5d3
[<ffffffff884c1f85>] :scst:scst_alloc_space+0x5b/0x106
[<ffffffff884bdc90>] :scst:scst_process_active_cmd+0x4fc/0x131c
[<ffffffff884bee46>] :scst:scst_cmd_init_done+0x17f/0x3ef
[<ffffffff884fb1ff>] :ib_srpt:srpt_handle_new_iu+0x281/0x4e7
[<ffffffff8835ec3d>] :mlx4_ib:mlx4_ib_free_srq_wqe+0x27/0x4f
[<ffffffff883591da>] :mlx4_ib:get_sw_cqe+0x12/0x30
[<ffffffff88359c97>] :mlx4_ib:mlx4_ib_poll_cq+0x432/0x48f
[<ffffffff884fcc43>] :ib_srpt:srpt_completion+0x190/0x250
[<ffffffff8811aa5b>] :mlx4_core:mlx4_eq_int+0x3b/0x26f
[<ffffffff8811ac9e>] :mlx4_core:mlx4_msi_x_interrupt+0xf/0x17

_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to