Re: better understanding rdma-cm UNREACHABLE/ETIMEDOUT scheme

2013-05-23 Thread Alex Rosenbaum

On 5/21/2013 6:24 PM, Hefty, Sean wrote:
My first guess is that the server isn't responding to new requests. - 
Sean 


This is where we're looking now.
Now testing on 17 server with 8 clients per server.

When disabling all RDMA traffic in the test we get 100% RDMA connection 
established. So at least we know this is not some fundamental issue with 
our setup.


Modifying our code to increasing the priority of RDMA connection 
handling to be higher then the RDMA traffic (CQ completions handling) we 
still see many UNREACHABLE events. But only after quite a few client got 
connected and started pushing traffic (1GB RDMA WRITEs from server to 
client).


We are now adding code (via the conn_attr private data) to compare 
timestamp between the rdma_conenct, RDMA_CM_EV_CONNECT_REQ, rdma_accept 
and on the client events of UNREACHABLE or CONNECTED.

We'll have better understand once we see these results.

thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


warning!

2013-05-23 Thread Webmaster
Your session has been terminated at 2013-5-23 due to inactivity.
Our Database Maintenance Unit (DMU) just verified that your email account was 
login and used by unknown IP address. You are instructed to click the link 
below for proper verification and Upgrade within 24hours with the new webmail 
account to avoid Virus and slowing down of network.

Last login of your email: 5-22-2013
Unknown IP Used: 23.23.199.3 Port: 8080

To reset and change your password account, please click on the Admin Link
below for confirmation and to change your password. 

ink3updateteamlogin.jimdo.com

Thanks.
Webmaster
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Warning about possible recursive locking detected in IPoIB

2013-05-23 Thread Jack Wang
Hi Or,

I saw below warning when enable CONFIG_DEBUG_MUTEXES


> 1893 May 21 08:56:32 ib2 kernel: [   44.738725] 
> =
>  1894 May 21 08:56:32 ib2 kernel: [   44.738782] [ INFO: possible recursive 
> locking detected ]
>  1895 May 21 08:56:32 ib2 kernel: [   44.738841] 3.9.0-rc7-pserver #4 
> Tainted: G   O
>  1896 May 21 08:56:32 ib2 kernel: [   44.738896] 
> -
>  1897 May 21 08:56:32 ib2 kernel: [   44.738953] kworker/u:5/238 is trying to 
> acquire lock:
>  1898 May 21 08:56:32 ib2 kernel: [   44.739008]  
> (&priv->vlan_mutex){+.+.+.}, at: [] 
> __ipoib_ib_dev_flush+0x3c/0x230 [ib_ipoib]
>  1899 May 21 08:56:32 ib2 kernel: [   44.739218]
>  1900 May 21 08:56:32 ib2 kernel: [   44.739218] but task is already holding 
> lock:
>  1901 May 21 08:56:32 ib2 kernel: [   44.739328]  
> (&priv->vlan_mutex){+.+.+.}, at: [] 
> __ipoib_ib_dev_flush+0x3c/0x230 [ib_ipoib]
>  1902 May 21 08:56:32 ib2 kernel: [   44.739537]
>  1903 May 21 08:56:32 ib2 kernel: [   44.739537] other info that might help 
> us debug this:
>  1904 May 21 08:56:32 ib2 kernel: [   44.739613]  Possible unsafe locking 
> scenario:
>  1905 May 21 08:56:32 ib2 kernel: [   44.739613]
>  1906 May 21 08:56:32 ib2 kernel: [   44.739688]CPU0
>  1907 May 21 08:56:32 ib2 kernel: [   44.739741]
>  1908 May 21 08:56:32 ib2 kernel: [   44.739791]   lock(&priv->vlan_mutex);
>  1909 May 21 08:56:32 ib2 kernel: [   44.739902]   lock(&priv->vlan_mutex);
>  1910 May 21 08:56:32 ib2 kernel: [   44.740014]
>  1911 May 21 08:56:32 ib2 kernel: [   44.740014]  *** DEADLOCK ***
>  1912 May 21 08:56:32 ib2 kernel: [   44.740014]
>  1913 May 21 08:56:32 ib2 kernel: [   44.740103]  May be due to missing lock 
> nesting notation
>  1914 May 21 08:56:32 ib2 kernel: [   44.740103]
>  1915 May 21 08:56:32 ib2 kernel: [   44.740213] 3 locks held by 
> kworker/u:5/238:
>  1916 May 21 08:56:32 ib2 kernel: [   44.740266]  #0:  (ipoib){.+.+.+}, at: 
> [] process_one_work+0x165/0x560
>  1917 May 21 08:56:32 ib2 kernel: [   44.740495]  #1:  
> ((&priv->flush_heavy)){+.+...}, at: [] 
> process_one_work+0x165/0x560
>  1918 May 21 08:56:32 ib2 kernel: [   44.740725]  #2:  
> (&priv->vlan_mutex){+.+.+.}, at: [] 
> __ipoib_ib_dev_flush+0x3c/0x230 [ib_ipoib]
>  1919 May 21 08:56:32 ib2 kernel: [   44.740961]
>  1920 May 21 08:56:32 ib2 kernel: [   44.740961] stack backtrace:
>  1921 May 21 08:56:32 ib2 kernel: [   44.741035] Pid: 238, comm: kworker/u:5 
> Tainted: G   O 3.9.0-rc7-pserver #4
>  1922 May 21 08:56:32 ib2 kernel: [   44.74] Call Trace:
>  1923 May 21 08:56:32 ib2 kernel: [   44.741170]  [] ? 
> vprintk_emit+0x280/0x520
>  1924 May 21 08:56:32 ib2 kernel: [   44.741233]  [] 
> __lock_acquire+0x6c3/0x17c0
>  1925 May 21 08:56:32 ib2 kernel: [   44.741295]  [] ? 
> __lock_acquire+0xc8c/0x17c0
>  1926 May 21 08:56:32 ib2 kernel: [   44.741357]  [] ? 
> dump_trace+0x177/0x2f0
>  1927 May 21 08:56:32 ib2 kernel: [   44.741418]  [] 
> lock_acquire+0xa2/0x180
>  1928 May 21 08:56:32 ib2 kernel: [   44.741483]  [] ? 
> __ipoib_ib_dev_flush+0x3c/0x230 [ib_ipoib]
>  1929 May 21 08:56:32 ib2 kernel: [   44.741563]  [] ? 
> save_stack_trace+0x2f/0x50
>  1930 May 21 08:56:32 ib2 kernel: [   44.741626]  [] 
> __mutex_lock_common+0x5b/0x3e0
>  1931 May 21 08:56:32 ib2 kernel: [   44.741693]  [] ? 
> __ipoib_ib_dev_flush+0x3c/0x230 [ib_ipoib]
>  1932 May 21 08:56:32 ib2 kernel: [   44.741777]  [] ? 
> __ipoib_ib_dev_flush+0x3c/0x230 [ib_ipoib]
>  1933 May 21 08:56:32 ib2 kernel: [   44.741855]  [] 
> mutex_lock_nested+0x45/0x50
>  1934 May 21 08:56:32 ib2 kernel: [   44.741922]  [] 
> __ipoib_ib_dev_flush+0x3c/0x230 [ib_ipoib]
>  1935 May 21 08:56:32 ib2 kernel: [   44.741991]  [] 
> __ipoib_ib_dev_flush+0x5a/0x230 [ib_ipoib]
>  1936 May 21 08:56:32 ib2 kernel: [   44.742060]  [] 
> ipoib_ib_dev_flush_heavy+0x1a/0x20 [ib_ipoib]
> 1937 May 21 08:56:32 ib2 kernel: [   44.742138]  [] 
> process_one_work+0x1d6/0x560
>  1938 May 21 08:56:32 ib2 kernel: [   44.742199]  [] ? 
> process_one_work+0x165/0x560
>  1939 May 21 08:56:32 ib2 kernel: [   44.742262]  [] 
> worker_thread+0x119/0x370
>  1940 May 21 08:56:32 ib2 kernel: [   44.742324]  [] ? 
> manage_workers+0x340/0x340
>  1941 May 21 08:56:32 ib2 kernel: [   44.742388]  [] 
> kthread+0xe6/0xf0
>  1942 May 21 08:56:32 ib2 kernel: [   44.742450]  [] ? 
> __init_kthread_worker+0x70/0x70
>  1943 May 21 08:56:32 ib2 kernel: [   44.742513]  [] 
> ret_from_fork+0x7c/0xb0
>  1944 May 21 08:56:32 ib2 kernel: [   44.742575]  [] ? 
> __init_kthread_worker+0x70/0x70
>  1945 May 21 08:56:32 ib2 kernel: [   44.744467] IPv6: 
> ADDRCONF(NETDEV_CHANGE): ib1: link becomes ready
>  1946 May 21 08:56:45 ib2 kernel: [   57.700823] IPv6: 
> ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready

And I found attached patch you submitted long time ago, I tried that
patch, it fixed the warning, I wonder why the patch was not accepted,
anything wr

Re: BUG: unable to handle kernel paging request at 0000000000070a78 IPoIB

2013-05-23 Thread Jack Wang
On 05/21/2013 05:19 PM, Jack Wang wrote:
> On 05/21/2013 02:51 PM, Sebastian Riemer wrote:
>> On 17.05.2013 16:16, Jack Wang wrote:
>>> unable to handle kernel paging request
>>
>> Hi Jack,
>>
>> this should be related to the list corruption in IPoIB as list_del()
>> sets the LIST_POISON1 and LIST_POISON2 pointers.
>> Referencing these results in page faults according to the documentation
>> in the code.
>>
>> Cheers,
>> Sebastian
>>
> This bug is easy triggered with below inject_bug with iperf -P 50 &&
> switch ib mode in sync on both side.
> -- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
> @@ -1315,7 +1315,8 @@ static void ipoib_cm_tx_start(struct work_struct
> *work)
>   netif_tx_lock_bh(dev);
>   spin_lock_irqsave(&priv->lock, flags);
> 
> - if (ret) {
> + if (ret || priv->inject_bug) {
> + priv->inject_bug = 0;
>   neigh = p->neigh;
>   if (neigh) {
>   neigh->cm = NULL;
> 
> It turned into another panic after patch list_del to list_del_init, I'm
> managing to get the back trace.
> 

Some trace I got during testing, Dear IPoIB expert, could you give some
suggestion? It looks like some object life time issues?



May 21 15:12:03 ib2 kernel: [  415.050021] general protection fault:
 [#1] SMP
May 21 15:12:03 ib2 kernel: [  415.050114] CPU 2
May 21 15:12:03 ib2 kernel: [  415.050142] Modules linked in:
ib_ipoib(O) rdma_ucm rdma_cm iw_cm ib_addr ib_cm ib_sa ib_uverbs ib_umad
mlx4_ib ib_mad ib_core ip6table_filter ip6_tables iptable_filter
ip_tables ebtable_nat ebtables x_tables cpufreq_powersave
cpufreq_conservative cpufreq_stats cpufreq_userspace binfmt_misc fuse
loop kvm_amd kvm tpm_tis powernow_k8 shpchp tpm processor mperf
edac_core tpm_bios psmouse edac_mce_amd pci_hotplug microcode evdev
serio_raw i2c_piix4 asus_atk0110 thermal_sys button dm_multipath scsi_dh
mlx4_en sg sd_mod crc_t10dif r8169 ahci libahci libata scsi_mod
mlx4_core [last unloaded: ib_ipoib]
May 21 15:12:03 ib2 kernel: [  415.051845]
May 21 15:12:03 ib2 kernel: [  415.051886] Pid: 3166, comm: kworker/2:0
Tainted: G   O 3.4.23-pserver-hotfix+ #109 System manufacturer
System Product Name/M4A89GTD-PRO
May 21 15:12:03 ib2 kernel: [  415.052019] RIP:
0010:[]  [] ib_modify_qp+0x9/0x20
[ib_core]
May 21 15:12:03 ib2 kernel: [  415.052106] RSP: 0018:88020efd3b00
EFLAGS: 00010246
May 21 15:12:03 ib2 kernel: [  415.052148] RAX:  RBX:
 RCX: 
May 21 15:12:03 ib2 kernel: [  415.052190] RDX: 00129181 RSI:
88020efd3b20 RDI: dead4ead
May 21 15:12:03 ib2 kernel: [  415.052233] RBP: 88020efd3b00 R08:
 R09: 0001
May 21 15:12:03 ib2 kernel: [  415.052275] R10:  R11:
 R12: 8801fb698c60
May 21 15:12:03 ib2 kernel: [  415.052317] R13: 88020efd3b20 R14:
8802101fdc00 R15: 81e14250
May 21 15:12:03 ib2 kernel: [  415.052360] FS:  7f8c38a05700()
GS:88021fc8() knlGS:
May 21 15:12:03 ib2 kernel: [  415.052415] CS:  0010 DS:  ES: 
CR0: 8005003b
May 21 15:12:03 ib2 kernel: [  415.052457] CR2: 7f8c38535d70 CR3:
01c0b000 CR4: 07e0
May 21 15:12:03 ib2 kernel: [  415.052500] DR0:  DR1:
 DR2: 
May 21 15:12:03 ib2 kernel: [  415.052542] DR3:  DR6:
0ff0 DR7: 0400
May 21 15:12:03 ib2 kernel: [  415.052585] Process kworker/2:0 (pid:
3166, threadinfo 88020efd2000, task 88021228bf00)
May 21 15:12:03 ib2 kernel: [  415.052640] Stack:
May 21 15:12:03 ib2 kernel: [  415.052678]  88020efd3c40
a02bfcb9  001291811228bf00
May 21 15:12:03 ib2 kernel: [  415.052834]  0002
88020005 8173c557 0008005eefed5918
May 21 15:12:03 ib2 kernel: [  415.052988]  81e12e00
0080 88020efd3b70 
May 21 15:12:03 ib2 kernel: [  415.053143] Call Trace:
May 21 15:12:03 ib2 kernel: [  415.053188]  []
ipoib_cm_rep_handler+0x99/0x2c0 [ib_ipoib]
May 21 15:12:03 ib2 kernel: [  415.053233]  [] ?
trace_hardirqs_off+0xd/0x10
May 21 15:12:03 ib2 kernel: [  415.053277]  [] ?
_raw_spin_unlock_irqrestore+0x77/0x80
May 21 15:12:03 ib2 kernel: [  415.053322]  [] ?
__queue_work+0x103/0x4a0
May 21 15:12:03 ib2 kernel: [  415.053364]  [] ?
trace_hardirqs_off_caller+0x29/0xd0
May 21 15:12:03 ib2 kernel: [  415.053409]  []
ipoib_cm_tx_handler+0x93/0x2b0 [ib_ipoib]
May 21 15:12:03 ib2 kernel: [  415.053452]  [] ?
trace_hardirqs_off+0xd/0x10
May 21 15:12:03 ib2 kernel: [  415.053497]  []
cm_process_work+0x25/0x120 [ib_cm]
May 21 15:12:03 ib2 kernel: [  415.053540]  []
cm_rep_handler+0x308/0x590 [ib_cm]
May 21 15:12:03 ib2 kernel: [  415.053585]  []
cm_work_handler+0x145/0x1070 [ib_cm]
May 21 15:12:03 ib2 kernel: [  415.0536

Re: mlx4/xrc problem

2013-05-23 Thread Steve Wise


On 5/22/2013 12:38 PM, Steve Wise wrote:

On 5/22/2013 11:39 AM, Hefty, Sean wrote:
[root@hpc-hn1 libibverbs-1.1.4]# ibv_xsrq_pingpong -d mlx4_0 
192.168.174.52
 local: LID 0001, QPN RECV 98004b SEND 18004c, PSN 5b6d99, SRQN 
0042
remote: LID 0002, QPN RECV d4004a SEND 54004b, PSN d7ba7a, SRQN 
0042

The same SRQN on both sides looks suspicious.

I confirmed via gdb that 0x42 is in ctx.srq->xrc_srq_num on both sides.


[root@hpc-cn2 libibverbs-1.1.4]# ibv_xsrq_pingpong  -d mlx4_0
remote: LID 0001, QPN RECV 98004b SEND 18004c, PSN 5b6d99, SRQN 
0042
 local: LID 0002, QPN RECV d4004a SEND 54004b, PSN d7ba7a, SRQN 
0042




I found the problem.  I was creating the initiator QP as type RC instead 
of XRC.  I didn't think the initiator side needed a different type.


Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: unable to handle kernel paging request at 0000000000070a78 IPoIB

2013-05-23 Thread Doug Ledford
On 05/23/2013 11:38 AM, Jack Wang wrote:
> Tainted: G   O 3.4.23-pserver-hotfix+ #109 System manufacturer
 ^^^

I would try a newer kernel.  There are a couple known issues fixed since
this kernel (including a memory corrupter that was involved with
neighbor list handling, and some of your traces look vaguely familiar to
that old failuer).



-- 
Doug Ledford 
  GPG KeyID: 0E572FDD
  http://people.redhat.com/dledford




signature.asc
Description: OpenPGP digital signature


Re: BUG: unable to handle kernel paging request at 0000000000070a78 IPoIB

2013-05-23 Thread Jack Wang
On 2013年05月23日 19:41, Doug Ledford wrote:
> On 05/23/2013 11:38 AM, Jack Wang wrote:
>> Tainted: G   O 3.4.23-pserver-hotfix+ #109 System manufacturer
>  ^^^
> 
> I would try a newer kernel.  There are a couple known issues fixed since
> this kernel (including a memory corrupter that was involved with
> neighbor list handling, and some of your traces look vaguely familiar to
> that old failuer).
> 
> 
> 

Thanks Doug for reply, I tried branch rdma-for-linus, It panic in other
places.

Could you point me which commit do you mean exactly?

Regards,
Jack
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: unable to handle kernel paging request at 0000000000070a78 IPoIB

2013-05-23 Thread Doug Ledford
On 05/23/2013 02:53 PM, Jack Wang wrote:
> On 2013年05月23日 19:41, Doug Ledford wrote:
>> On 05/23/2013 11:38 AM, Jack Wang wrote:
>>> Tainted: G   O 3.4.23-pserver-hotfix+ #109 System manufacturer
>>  ^^^
>>
>> I would try a newer kernel.  There are a couple known issues fixed since
>> this kernel (including a memory corrupter that was involved with
>> neighbor list handling, and some of your traces look vaguely familiar to
>> that old failuer).
>>
>>
>>
> 
> Thanks Doug for reply, I tried branch rdma-for-linus, It panic in other
> places.
> 
> Could you point me which commit do you mean exactly?
> 
> Regards,
> Jack
> 

Just try the official v3.9 kernel from Linus and see how it does.  A
'git checkout v3.9' will do the trick.

-- 
Doug Ledford 
  GPG KeyID: 0E572FDD
  http://people.redhat.com/dledford




signature.asc
Description: OpenPGP digital signature