Hi Bart,

I'll try to explain, as Sebastian is on vacation.

On 23.08.2012 16:43, Bart Van Assche wrote:
> On 08/23/12 15:59, Sebastian Riemer wrote:
> > we've triggered the WARN_ON() in srp_wait_last_send_wqe() by connecting
> > to a disabled SCST SRP target.
> > 
> > I would remove that one.
> > 
> > [ ... ]
> >
> >> +  while (!target->last_send_wqe && time_before(jiffies, deadline)) {
> >> +          srp_send_completion(target->send_cq, target);
> >> +          msleep(20);
> >> +  }
> >> +
> >> +  WARN_ON(!target->last_send_wqe);
> > 
> > <-- here it is - remove it
> 
> But why was that WARN_ON() statement hit ? srp_wait_last_send_wqe() is
> invoked after the QP has been transitioned into the error state. It is
> the responsibility of the HCA to generate an error completion for any
> work queued on a QP that is in the error state. If that WARN_ON()
> statement has been hit that means that it took more than the RC timeout
> before the HCA finished processing earlier queued work and generated an
> error completion. That's not really something I had expected.

That occurs usually when releasing multiple targets at the same time.
A typical situation is unloading kernel module ib_srp.ko immediately,
which leads to tearing down every Infiniband connection.
But it doesn't occur always, which makes it hard for us to test.

Example of kernel trace:

WARNING: at drivers/infiniband/ulp/srp/ib_srp.c:529
srp_disconnect_target+0x317/0x320 [ib_srp]()
Hardware name: H8DGU
Modules linked in:
rdma_ucm rdma_cm iw_cm ib_addr ib_ipoib ib_uverbs ib_umad ib_srp
scsi_transport_srp scsi_tgt ib_cm ib_sa loop ib_mthca psmouse ib_mad
amd64_edac_mod edac_core i2c_piix4 evdev serio_raw edac_mce_amd
ib_core tpm_tis tpm tpm_bios processor button thermal_sys sg
hid_cherry sd_mod crc_t10dif usb_storage ahci libahci libata scsi_mod
[last unloaded: scsi_wait_scan]
Pid: 101, comm: kworker/1:1 Tainted: G W 3.2.8-pserver #1
Call Trace:
[<ffffffff81048dbb>] ? warn_slowpath_common+0x7b/0xc0
[<ffffffffa00a9b07>] ? srp_disconnect_target+0x317/0x320 [ib_srp]
[<ffffffff8106a640>] ? wake_up_bit+0x40/0x40
[<ffffffffa00ab0bf>] ? srp_remove_work+0x13f/0x1c0 [ib_srp]
[<ffffffffa00aaf80>] ? srp_free_req_data+0xd0/0xd0 [ib_srp]
[<ffffffff81063383>] ? process_one_work+0x113/0x470
[<ffffffff81065c73>] ? worker_thread+0x163/0x3e0
[<ffffffff81065b10>] ? manage_workers+0x200/0x200
[<ffffffff81065b10>] ? manage_workers+0x200/0x200
[<ffffffff8106a126>] ? kthread+0x96/0xa0
[<ffffffff8165f674>] ? kernel_thread_helper+0x4/0x10
[<ffffffff8106a090>] ? kthread_worker_fn+0x180/0x180
[<ffffffff8165f670>] ? gs_change+0x13/0x13

Regards,
Dongsu

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to