On 30/08/2015 21:23, Sagi Grimberg wrote:
> 
> Looks like for some reason cm_get_bth_pkey got pkey_index of 0xffff
> instead of 0 (working on the default pkey 0xffff at entry 0).

It looks like the mlx5 driver doesn't interpret the completion format
correctly. It takes a field defined in the programmer reference manual
as pkey, and interprets it as pkey_index [1].

> log:
> infiniband mlx5_0: ib_cm: Couldn't retrieve pkey for incoming request (port 
> 1, pkey index 65535). -22
> ib_srpt Received SRP_LOGIN_REQ with i_port_id 0x0:0x2c90300ed0960, t_port_id 
> 0x2c90300ed0950:0x2c90300ed0950 and it_iu_len 260 on port 1 
> (guid=0xfe80000000000000:0x2c90300ed0950)
> ib_srpt Session : kernel thread ib_srpt_compl (PID 8584) started
> infiniband mlx5_0: ib_cm: Couldn't retrieve pkey for incoming request (port 
> 1, pkey index 65535). -22
> ib_srpt Received SRP_LOGIN_REQ with i_port_id 0x0:0x2c90300ed0960, t_port_id 
> 0x2c90300ed0950:0x2c90300ed0950 and it_iu_len 260 on port 1 
> (guid=0xfe80000000000000:0x2c90300ed0950)
> ib_srpt Session : kernel thread ib_srpt_compl (PID 8585) started
> mlx5_0:dump_cqe:238:(pid 8584): dump error cqe
> 00000000 00000000 00000000 00000000
> 00000000 00000000 00000000 00000000
> 0000002b 00000000 00000000 00000000
> 00000000 94003004 0000002c 0000b8e0
> ib_srpt receiving failed for idx 0 with status 4
> 0000:04:00.0:poll_health:151:(pid 0): device's health compromised
> assert_var[0] 0x00000094
> assert_var[1] 0x00000000
> assert_var[2] 0x00000000
> assert_var[3] 0x00000000
> assert_var[4] 0x00000000
> assert_exit_ptr 0x0061d35c
> assert_callra 0x0067a5f4
> fw_ver 0xa0641900
> hw_id 0x000001ff
> irisc_index 2
> synd 0x1: firmware internal error
> ext_sync 0x0000
> 0000:04:00.0:health_care:76:(pid 7943): handling bad device here
> ib_srpt Received DREQ and sent DREP for session 
> 0x00000000000000000002c90300ed0960.
> ib_srpt Received DREQ and sent DREP for session 
> 0x00000000000000000002c90300ed0960.
> ib_srpt Received IB TimeWait exit for cm_id ffff88046d1fb200.
> ib_srpt Received IB TimeWait exit for cm_id ffff880454ffa000.
> ib_srpt Session 0x00000000000000000002c90300ed0960: kernel thread 
> ib_srpt_compl (PID 8585) stopped

I don't know how that can cause all the other errors though.

Haggai

[1]
http://lxr.free-electrons.com/source/drivers/infiniband/hw/mlx5/cq.c?v=4.1#L230
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to