AW: IPoIB GRO
Thats why the flush flag is always set and the GRO stack does not work at all. I'm willing to dig deeper into this but I'm unsure if those fields are filled on sender or receiver side and especially where in the IPoIB stack. Maybe I got the reason for that strange ack behaviour during large NFS over IPoIB reads and hopefully someone can confirm this If I turn on TSO for an IPoIB datagram interface on the sender side GRO on the receiver side is totally broken. This due to the fact that TSO generates large 60k packets that are offloaded into fragments. Each of these fragments has the same ID in the packet header. GRO expects IDs to be in incremental order and issues a flush after each package. Each flush results in an ACK packet back to the server. With TSO disabled GRO can kick in. Packets are build with sequential IDs. GRO only acknowledges every few packets. For a fully cached file read of 6GB the numbers read: TSO on: ~220MByte/s - 1,522,679 MLX4 Interrupts on server TSO off: ~550MByte/s - 318,322 MLX4 Interrupts on server Is there any chance IPoIB TSO handling can be optimized? Markus Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Mail ist nicht gestattet. Ãber das Internet versandte E-Mails können unter fremden Namen erstellt oder manipuliert werden. Deshalb ist diese als E-Mail verschickte Nachricht keine rechtsverbindliche Willenserklärung. Collogia Unternehmensberatung AG Ubierring 11 D-50678 Köln Vorstand: Kadir Akin Dr. Michael Höhnerbach Vorsitzender des Aufsichtsrates: Hans Kristian Langva Registergericht: Amtsgericht Köln Registernummer: HRB 52 497 This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of the material in this e-mail is strictly forbidden. e-mails sent over the internet may have been written under a wrong name or been manipulated. That is why this message sent as an e-mail is not a legally binding declaration of intention. Collogia Unternehmensberatung AG Ubierring 11 D-50678 Köln executive board: Kadir Akin Dr. Michael Höhnerbach President of the supervisory board: Hans Kristian Langva Registry office: district court Cologne Register number: HRB 52 497
Re: IPoIB GRO
Hi Markus, As Or already mentioned, it seems that we have accumulations of ip packets, when GRO is enabled over ib interface, from tcpdump in the recieve side we can see: 10:09:27.336951 IP 11.134.33.1.41377 11.134.41.1.35957: Flags [.], seq 3795959253:3796023381, ack 2, win 110, length 64128 10:09:27.336987 IP 11.134.41.1.35957 11.134.33.1.41377: Flags [.], ack 3796023381, win 2036, length 0 10:09:27.337022 IP 11.134.33.1.41377 11.134.41.1.35957: Flags [.], seq 3796023381:3796087509, ack 2, win 110, length 64128 10:09:27.337044 IP 11.134.41.1.35957 11.134.33.1.41377: Flags [.], ack 3796087509, win 3038, length 0 10:09:27.337083 IP 11.134.33.1.41377 11.134.41.1.35957: Flags [.], seq 3796087509:3796151637, ack 2, win 110, length 64128 10:09:27.337107 IP 11.134.41.1.35957 11.134.33.1.41377: Flags [.], ack 3796151637, win 4040, length 0 10:09:27.337142 IP 11.134.33.1.41377 11.134.41.1.35957: Flags [.], seq 3796151637:3796215765, ack 2, win 110, length 64128 . don't you see that behaviour in tcpdump? what kernel are you using? I will take a look into the gro/our code to check if we missed something, and update. Thanks, Erez Hello, I have a little update to the unlucky GRO IPoIB behaviour I observed in the last weeks in datagram mode on our ConnectX cards. In the GRO receive path the kernel steps into the inet_gro_receive() function of net/ipv4/af_inet.c. If I read the code right it compares two IP packets and decides if they come from the same flow. Further checks are included in some subroutines that narrow down the comparison to IPv4 and so on. I put a debugging message into the following comparison that seems to be the culprit of it all. inet_gro_receive() ... /* All fields must match except length and checksum. */ NAPI_GRO_CB(p)-flush |= (iph-ttl ^ iph2-ttl) | (iph-tos ^ iph2-tos) | (__force int)((iph-frag_off ^ iph2-frag_off) htons(IP_DF)) | ((u16)(ntohs(iph2-id) + NAPI_GRO_CB(p)-count) ^ id); /* Do some debug */ printk(%i %i %i\n,ntohs(iph2-id),NAPI_GRO_CB(p)-count,id); ... On a normal GBit Intel card the kernel output reads: 32933 12 32945 32933 13 32946 32946 1 32947 32946 2 32948 ... 32946 15 32961 32964 3 32967 32964 4 32968 ... The interpretation of it all should be that packet ids must match the sum of the initial packet id plus its count field. Then we have a GRO candidate. On our ib0 interface the count field of a received packet seems to be 1 most of the time and the packet id always matches the initial packet id: 35754 1 35754 35754 1 35754 35754 1 35754 ... 35754 1 35786 35786 1 35786 35786 1 35786 ... Thats why the flush flag is always set and the GRO stack does not work at all. I'm willing to dig deeper into this but I'm unsure if those fields are filled on sender or receiver side and especially where in the IPoIB stack. Maybe someone can point me into the right direction so that I can dig deeper and provide some more information. Bet regards. Markus -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
AW: IPoIB GRO
Hi Erez, don't you see that behaviour in tcpdump? what kernel are you using? On server side we have a 3.5 on client side a 3.11 kernel each of them with kernel standard drivers/modules. I can see the same pattern of GRO aggregation on the client that you mention but only if I disable TSO for ib0 on the server side. The test I'm running on the client is like this. The second and third read run are definetly served by the NFS server side cache. sysctl -w net.ipv4.tcp_mem=4096 65536 4194304 sysctl -w net.ipv4.tcp_rmem=4096 65536 4194304 sysctl -w net.ipv4.tcp_wmem=4096 65536 4194304 sysctl -w net.core.rmem_max=8388608 sysctl -w net.core.wmem_max=8388608 mount -o nfsvers=3,rsize=262144,wsize=262144 10.10.30.251:/export /mnt echo 3 /proc/sys/vm/drop_caches dd if=/mnt/xxx.iso of=/dev/null bs=1M count=5000 echo 3 /proc/sys/vm/drop_caches dd if=/mnt/xxx.iso of=/dev/null bs=1M count=5000 echo 3 /proc/sys/vm/drop_caches dd if=/mnt/xxx.iso of=/dev/null bs=1M count=5000 umount /mnt Markus Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Mail ist nicht gestattet. Ãber das Internet versandte E-Mails können unter fremden Namen erstellt oder manipuliert werden. Deshalb ist diese als E-Mail verschickte Nachricht keine rechtsverbindliche Willenserklärung. Collogia Unternehmensberatung AG Ubierring 11 D-50678 Köln Vorstand: Kadir Akin Dr. Michael Höhnerbach Vorsitzender des Aufsichtsrates: Hans Kristian Langva Registergericht: Amtsgericht Köln Registernummer: HRB 52 497 This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of the material in this e-mail is strictly forbidden. e-mails sent over the internet may have been written under a wrong name or been manipulated. That is why this message sent as an e-mail is not a legally binding declaration of intention. Collogia Unternehmensberatung AG Ubierring 11 D-50678 Köln executive board: Kadir Akin Dr. Michael Höhnerbach President of the supervisory board: Hans Kristian Langva Registry office: district court Cologne Register number: HRB 52 497
Re: AW: IPoIB GRO
Hi Markus, Can you please tell me what is the FW version you have on your ConnectX cards? Thanks, Erez Hi Erez, don't you see that behaviour in tcpdump? what kernel are you using? On server side we have a 3.5 on client side a 3.11 kernel each of them with kernel standard drivers/modules. I can see the same pattern of GRO aggregation on the client that you mention but only if I disable TSO for ib0 on the server side. The test I'm running on the client is like this. The second and third read run are definetly served by the NFS server side cache. sysctl -w net.ipv4.tcp_mem=4096 65536 4194304 sysctl -w net.ipv4.tcp_rmem=4096 65536 4194304 sysctl -w net.ipv4.tcp_wmem=4096 65536 4194304 sysctl -w net.core.rmem_max=8388608 sysctl -w net.core.wmem_max=8388608 mount -o nfsvers=3,rsize=262144,wsize=262144 10.10.30.251:/export /mnt echo 3 /proc/sys/vm/drop_caches dd if=/mnt/xxx.iso of=/dev/null bs=1M count=5000 echo 3 /proc/sys/vm/drop_caches dd if=/mnt/xxx.iso of=/dev/null bs=1M count=5000 echo 3 /proc/sys/vm/drop_caches dd if=/mnt/xxx.iso of=/dev/null bs=1M count=5000 umount /mnt Markus -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH opensm] When exiting, update SADB only in MASTER state
From: Vladimir Koushnir vladim...@mellanox.com Signed-off-by: Vladimir Koushnir vladim...@mellanox.com Signed-off-by: Hal Rosenstock h...@mellanox.com --- opensm/osm_opensm.c |3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/opensm/osm_opensm.c b/opensm/osm_opensm.c index b3c487f..f2e04f6 100644 --- a/opensm/osm_opensm.c +++ b/opensm/osm_opensm.c @@ -132,7 +132,8 @@ void osm_opensm_destroy(IN osm_opensm_t * p_osm) cl_disp_shutdown(p_osm-sminfo_get_disp); /* dump SA DB */ - if (p_osm-subn.opt.sa_db_dump) + if ((p_osm-sm.p_subn-sm_state == IB_SMINFO_STATE_MASTER) +p_osm-subn.opt.sa_db_dump) osm_sa_db_file_dump(p_osm); /* do the destruction in reverse order as init */ -- 1.7.8.2 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH opensm] Fix timeout handling for pkeyGet for sw port 0
From: Dan Ben Yosef da...@mellanox.com Remove node in addition to port when getting timeout for pkeyGet for sw port 0. Signed-off-by: Dan Ben Yosef da...@mellanox.com Signed-off-by: Hal Rosenstock h...@mellanox.com --- diff --git a/opensm/osm_drop_mgr.c b/opensm/osm_drop_mgr.c index 85a6f58..11c1561 100644 --- a/opensm/osm_drop_mgr.c +++ b/opensm/osm_drop_mgr.c @@ -535,7 +535,15 @@ void osm_drop_mgr_process(osm_sm_t * sm) drop_mgr_process_node(sm, p_node); else { /* -* Drop port if there was timeout for GetPKeyTable +* We want to preserve the configured pkey indexes, +* so if we don't receive GetResp P_KeyTable for some block, +* do the following: +* 1. Drop node if the node is sw and got timeout for port 0. +* 2. Drop node if node is HCA/RTR. +* 3. Drop only physp if got timeout for sw when the port isn't 0. +* We'll set error during initialization in order to +* cause an immediate heavy sweep and try to get the +* configured P_KeyTable again. */ if (osm_node_get_type(p_node) == IB_NODE_TYPE_SWITCH) port_num = 0; @@ -547,12 +555,12 @@ void osm_drop_mgr_process(osm_sm_t * sm) if (!p_physp || p_physp-pkeys.rcv_blocks_cnt == 0) continue; sm-p_subn-subnet_initialization_error = TRUE; - if (!port_num || osm_node_get_type(p_node) != IB_NODE_TYPE_SWITCH) { - port_guid = osm_physp_get_port_guid(p_physp); - p_port = osm_get_port_by_guid(sm-p_subn, port_guid); - p_port-discovery_count = 0; - } else + port_guid = osm_physp_get_port_guid(p_physp); + p_port = osm_get_port_by_guid(sm-p_subn, port_guid); + if (p_node-physp_discovered[port_num]) { p_node-physp_discovered[port_num] = 0; + p_port-discovery_count--; + } } } } -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH librdmacm] Makefile.am: Add missing riostream man page to man_MANS
Signed-off-by: Hal Rosenstock h...@mellanox.com --- diff --git a/Makefile.am b/Makefile.am index 4e3dee7..bf72134 100644 --- a/Makefile.am +++ b/Makefile.am @@ -125,6 +125,7 @@ man_MANS = \ man/rdma_client.1 \ man/rdma_xserver.1 \ man/rdma_xclient.1 \ + man/riostream.1 \ man/rstream.1 \ man/rcopy.1 \ man/rdma_cm.7 \ -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
AW: AW: IPoIB GRO
Hi Markus, Can you please tell me what is the FW version you have on your ConnectX cards? of course. the server has: root@client:~# ibstat CA 'mlx4_0' CA type: MT26418 Number of ports: 1 Firmware version: 2.9.1000 Hardware version: a0 Node GUID: 0x0002c903000ec11a System image GUID: 0x0002c903000ec11d Port 1: State: Active Physical state: LinkUp Rate: 20 Base lid: 4 LMC: 0 SM lid: 2 Capability mask: 0x02510868 Port GUID: 0x0002c903000ec11b The client has an older 2.7.x firmware. Mostly because of the X58 chipset incompatibility with newer firmwares. Your question suggests that this behaviour may be related to the older firmware. So I changed the client side test to another host with newer firmware. Nevertheless the TSO problem occurs there too. root@client:~# ibstat CA 'mlx4_0' CA type: MT25418 Number of ports: 2 Firmware version: 2.9.1000 Hardware version: a0 Node GUID: 0x001e0b4cf9c4 System image GUID: 0x001e0b4cf9c7 Port 1: State: Active Physical state: LinkUp Rate: 20 Base lid: 14 LMC: 0 SM lid: 2 Capability mask: 0x02510868 Port GUID: 0x001e0b4cf9c5 Port 2: State: Down Physical state: Polling Rate: 10 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x02510868 Port GUID: 0x001e0b4cf9c6 Best regards thanks in advance. Markus Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Mail ist nicht gestattet. Ãber das Internet versandte E-Mails können unter fremden Namen erstellt oder manipuliert werden. Deshalb ist diese als E-Mail verschickte Nachricht keine rechtsverbindliche Willenserklärung. Collogia Unternehmensberatung AG Ubierring 11 D-50678 Köln Vorstand: Kadir Akin Dr. Michael Höhnerbach Vorsitzender des Aufsichtsrates: Hans Kristian Langva Registergericht: Amtsgericht Köln Registernummer: HRB 52 497 This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of the material in this e-mail is strictly forbidden. e-mails sent over the internet may have been written under a wrong name or been manipulated. That is why this message sent as an e-mail is not a legally binding declaration of intention. Collogia Unternehmensberatung AG Ubierring 11 D-50678 Köln executive board: Kadir Akin Dr. Michael Höhnerbach President of the supervisory board: Hans Kristian Langva Registry office: district court Cologne Register number: HRB 52 497
RE: linux-next: build warning after merge of the infiniband tree
This issue was caught by Tetsuo Handa and Acked on 10/30: http://marc.info/?t=13831336458r=1w=2. Roland, I noticed that the Tetsuo's original message didn't cc the linux-rdma list? Mike -Original Message- From: Stephen Rothwell [mailto:s...@canb.auug.org.au] Sent: Sunday, November 03, 2013 11:55 PM To: Roland Dreier; linux-rdma@vger.kernel.org Cc: linux-n...@vger.kernel.org; linux-ker...@vger.kernel.org; Jan Kara; Marciniszyn, Mike Subject: linux-next: build warning after merge of the infiniband tree Hi all, After merging the infiniband tree, today's linux-next build (x86_64 allmodconfig) produced this warning: drivers/infiniband/hw/ipath/ipath_user_sdma.c: In function 'ipath_user_sdma_pin_pages': drivers/infiniband/hw/ipath/ipath_user_sdma.c:283:6: warning: 'j' is used uninitialized in this function [-Wuninitialized] ret = get_user_pages_fast(addr, j, 0, pages); ^ Introduced by commit 18fec3c6bdcb (IB/ipath: Convert ipath_user_sdma_pin_pages() to use get_user_pages_fast()). How did that pass review or testing? -- Cheers, Stephen Rothwells...@canb.auug.org.au -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC v2 00/10] Introduce Signature feature
On Sat, 2013-11-02 at 14:57 -0700, Bart Van Assche wrote: On 1/11/2013 18:36, Nicholas A. Bellinger wrote: On Fri, 2013-11-01 at 08:03 -0700, Bart Van Assche wrote: On 31/10/2013 5:24, Sagi Grimberg wrote: In T10-DIF, when a series of 512-byte data blocks are transferred, each block is followed by an 8-byte guard. The guard consists of CRC that protects the integrity of the data in the block, and some other tags that protects against mis-directed IOs. Shouldn't that read logical block length divided by 2**(protection interval exponent) instead of 512 ? From the SPC-4 FORMAT UNIT section: Why should the protection interval in FORMAT_UNIT be mentioned when it's not supported by the hardware, nor by drivers/scsi/sd_dif.c itself..? Hello Nick, My understanding is that this patch series is not only intended for initiator drivers but also for target drivers like ib_srpt and ib_isert. As you know target drivers do not restrict the initiator operating system to Linux. Although I do not know whether there are already operating systems that support the protection interval exponent, It's my understanding that Linux is still the only stack that supports DIF, so AFAICT no one is actually supporting this. I think it is a good idea to stay as close as possible to the terminology of the SPC-4 standard. No, in this context it only adds pointless misdirection because 1) The hardware in question doesn't support it, and 2) Linux itself doesn't support it. --nab -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: AW: IPoIB GRO
I looked at TSO code earlier this year. IIRC, if TSO is on, the upper layer (e.g. IP) would just send the super-packet down (to IPOIB) w/out segmentation (for send); if off, it then does the segmentation (to match the MTU size) before calling device's send. For GSO, I would imagine it needs some sorts of segmentation sequence to know how to pull them together on the receive end. Look to me that the segmentation offload (TSO) and receive offload (GSO) are mutual exclusive ? Check out dev_gro_receive() (line number based on 2.6.32 RHEL kernel): 2980 2981 if (skb_is_gso(skb) || skb_has_frags(skb)) 2982 goto normal; See how it bails out when TSO (skb_is_gso()) is on ? So it looks like an IPOIB bug that ipoib_ib_handle_rx_wc() does a unconditional napi_gro_receive() regardless adapter capability (and TSO setting). Just a guess ! -- Wendy -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html