AW: IPoIB GRO

2013-11-04 Thread Markus Stockhausen
 Thats why the flush flag is always set and the GRO stack does
 not work at all. I'm willing to dig deeper into this but I'm unsure
 if those fields are filled on sender or receiver side and especially
 where in the IPoIB stack. 

Maybe I got the reason for that strange ack behaviour during
large NFS over IPoIB reads and hopefully someone can confirm this

If I turn on TSO for an IPoIB datagram interface on the sender side 
GRO on the receiver side is totally broken. This due to the fact that 
TSO generates large 60k packets that are offloaded into 
fragments. Each of these fragments has the same ID in the packet 
header. GRO expects IDs to be in incremental order and issues a 
flush after each package. Each flush results in an ACK packet back 
to the server.

With TSO disabled GRO can kick in. Packets are build with 
sequential IDs. GRO only acknowledges every few packets.

For a fully cached file read of 6GB the numbers read:

TSO on: ~220MByte/s - 1,522,679 MLX4 Interrupts on server
TSO off: ~550MByte/s - 318,322 MLX4 Interrupts on server

Is there any chance IPoIB TSO handling can be optimized?

Markus

Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte
Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail
irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und
vernichten Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte
Weitergabe dieser Mail ist nicht gestattet.

Über das Internet versandte E-Mails können unter fremden Namen erstellt oder
manipuliert werden. Deshalb ist diese als E-Mail verschickte Nachricht keine
rechtsverbindliche Willenserklärung.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln

Vorstand:
Kadir Akin
Dr. Michael Höhnerbach

Vorsitzender des Aufsichtsrates:
Hans Kristian Langva

Registergericht: Amtsgericht Köln
Registernummer: HRB 52 497

This e-mail may contain confidential and/or privileged information. If you
are not the intended recipient (or have received this e-mail in error)
please notify the sender immediately and destroy this e-mail. Any
unauthorized copying, disclosure or distribution of the material in this
e-mail is strictly forbidden.

e-mails sent over the internet may have been written under a wrong name or
been manipulated. That is why this message sent as an e-mail is not a
legally binding declaration of intention.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln

executive board:
Kadir Akin
Dr. Michael Höhnerbach

President of the supervisory board:
Hans Kristian Langva

Registry office: district court Cologne
Register number: HRB 52 497




Re: IPoIB GRO

2013-11-04 Thread Erez Shitrit

Hi Markus,

As Or already mentioned, it seems that we have accumulations of ip 
packets, when GRO is enabled over ib interface, from tcpdump in the 
recieve side we can see:


10:09:27.336951 IP 11.134.33.1.41377  11.134.41.1.35957: Flags [.], seq 
3795959253:3796023381, ack 2, win 110, length 64128
10:09:27.336987 IP 11.134.41.1.35957  11.134.33.1.41377: Flags [.], ack 
3796023381, win 2036, length 0
10:09:27.337022 IP 11.134.33.1.41377  11.134.41.1.35957: Flags [.], seq 
3796023381:3796087509, ack 2, win 110, length 64128
10:09:27.337044 IP 11.134.41.1.35957  11.134.33.1.41377: Flags [.], ack 
3796087509, win 3038, length 0
10:09:27.337083 IP 11.134.33.1.41377  11.134.41.1.35957: Flags [.], seq 
3796087509:3796151637, ack 2, win 110, length 64128
10:09:27.337107 IP 11.134.41.1.35957  11.134.33.1.41377: Flags [.], ack 
3796151637, win 4040, length 0
10:09:27.337142 IP 11.134.33.1.41377  11.134.41.1.35957: Flags [.], seq 
3796151637:3796215765, ack 2, win 110, length 64128

.


don't you see that behaviour in tcpdump? what kernel are you using?

I will take a look into the gro/our code to check if we missed 
something, and update.


Thanks, Erez


Hello,

I have a little update to the unlucky GRO IPoIB behaviour I observed
in the last weeks in datagram mode on our ConnectX cards. In the
GRO receive path the kernel steps into the inet_gro_receive() function
of net/ipv4/af_inet.c. If I read the code right it compares two
IP packets and decides if they come from the same flow.
Further checks are included in some subroutines that narrow
down the comparison to IPv4 and so on.

I put a debugging message into the following comparison that
seems to be the culprit of it all.

inet_gro_receive()
   ...
   /* All fields must match except length and checksum. */
   NAPI_GRO_CB(p)-flush |=
 (iph-ttl ^ iph2-ttl) |
 (iph-tos ^ iph2-tos) |
 (__force int)((iph-frag_off ^ iph2-frag_off)  htons(IP_DF)) |
 ((u16)(ntohs(iph2-id) + NAPI_GRO_CB(p)-count) ^ id);
   /* Do some debug */
   printk(%i %i %i\n,ntohs(iph2-id),NAPI_GRO_CB(p)-count,id);
   ...

On a normal GBit Intel card the kernel output reads:

32933 12 32945
32933 13 32946
32946 1 32947
32946 2 32948
...
32946 15 32961
32964 3 32967
32964 4 32968
...

The interpretation of it all should be that packet ids must match
the sum of the initial packet id plus its count field. Then
we have a GRO candidate.

On our ib0 interface the count field of a received packet seems
to be 1 most of the time and the packet id always matches the
initial packet id:

35754 1 35754
35754 1 35754
35754 1 35754
...
35754 1 35786
35786 1 35786
35786 1 35786
...

Thats why the flush flag is always set and the GRO stack does
not work at all. I'm willing to dig deeper into this but I'm unsure
if those fields are filled on sender or receiver side and especially
where in the IPoIB stack. Maybe someone can point me into the
right direction so that I can dig deeper and provide some more
information.

Bet regards.

Markus



--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


AW: IPoIB GRO

2013-11-04 Thread Markus Stockhausen
Hi Erez,

 don't you see that behaviour in tcpdump? what kernel are you using?

On server side we have a 3.5 on client side a 3.11 kernel each of them with
kernel standard drivers/modules. I can see the same pattern of GRO 
aggregation on the client that you mention but only if I disable TSO for 
ib0 on the server side. 

The test I'm running on the client is like this. The second and third read
run are definetly served by the NFS server side cache.

sysctl -w net.ipv4.tcp_mem=4096 65536 4194304
sysctl -w net.ipv4.tcp_rmem=4096 65536 4194304
sysctl -w net.ipv4.tcp_wmem=4096 65536 4194304
sysctl -w net.core.rmem_max=8388608
sysctl -w net.core.wmem_max=8388608

mount -o nfsvers=3,rsize=262144,wsize=262144 10.10.30.251:/export /mnt
echo 3  /proc/sys/vm/drop_caches
dd if=/mnt/xxx.iso of=/dev/null bs=1M count=5000
echo 3  /proc/sys/vm/drop_caches
dd if=/mnt/xxx.iso of=/dev/null bs=1M count=5000
echo 3  /proc/sys/vm/drop_caches
dd if=/mnt/xxx.iso of=/dev/null bs=1M count=5000
umount /mnt

Markus

Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte
Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail
irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und
vernichten Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte
Weitergabe dieser Mail ist nicht gestattet.

Über das Internet versandte E-Mails können unter fremden Namen erstellt oder
manipuliert werden. Deshalb ist diese als E-Mail verschickte Nachricht keine
rechtsverbindliche Willenserklärung.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln

Vorstand:
Kadir Akin
Dr. Michael Höhnerbach

Vorsitzender des Aufsichtsrates:
Hans Kristian Langva

Registergericht: Amtsgericht Köln
Registernummer: HRB 52 497

This e-mail may contain confidential and/or privileged information. If you
are not the intended recipient (or have received this e-mail in error)
please notify the sender immediately and destroy this e-mail. Any
unauthorized copying, disclosure or distribution of the material in this
e-mail is strictly forbidden.

e-mails sent over the internet may have been written under a wrong name or
been manipulated. That is why this message sent as an e-mail is not a
legally binding declaration of intention.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln

executive board:
Kadir Akin
Dr. Michael Höhnerbach

President of the supervisory board:
Hans Kristian Langva

Registry office: district court Cologne
Register number: HRB 52 497




Re: AW: IPoIB GRO

2013-11-04 Thread Erez Shitrit

Hi Markus,

Can you please tell me what is the FW version you have on your ConnectX 
cards?


Thanks, Erez


Hi Erez,


don't you see that behaviour in tcpdump? what kernel are you using?

On server side we have a 3.5 on client side a 3.11 kernel each of them with
kernel standard drivers/modules. I can see the same pattern of GRO
aggregation on the client that you mention but only if I disable TSO for
ib0 on the server side.

The test I'm running on the client is like this. The second and third read
run are definetly served by the NFS server side cache.

sysctl -w net.ipv4.tcp_mem=4096 65536 4194304
sysctl -w net.ipv4.tcp_rmem=4096 65536 4194304
sysctl -w net.ipv4.tcp_wmem=4096 65536 4194304
sysctl -w net.core.rmem_max=8388608
sysctl -w net.core.wmem_max=8388608

mount -o nfsvers=3,rsize=262144,wsize=262144 10.10.30.251:/export /mnt
echo 3  /proc/sys/vm/drop_caches
dd if=/mnt/xxx.iso of=/dev/null bs=1M count=5000
echo 3  /proc/sys/vm/drop_caches
dd if=/mnt/xxx.iso of=/dev/null bs=1M count=5000
echo 3  /proc/sys/vm/drop_caches
dd if=/mnt/xxx.iso of=/dev/null bs=1M count=5000
umount /mnt

Markus


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH opensm] When exiting, update SADB only in MASTER state

2013-11-04 Thread Hal Rosenstock

From: Vladimir Koushnir vladim...@mellanox.com

Signed-off-by: Vladimir Koushnir vladim...@mellanox.com
Signed-off-by: Hal Rosenstock h...@mellanox.com
---
 opensm/osm_opensm.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/opensm/osm_opensm.c b/opensm/osm_opensm.c
index b3c487f..f2e04f6 100644
--- a/opensm/osm_opensm.c
+++ b/opensm/osm_opensm.c
@@ -132,7 +132,8 @@ void osm_opensm_destroy(IN osm_opensm_t * p_osm)
cl_disp_shutdown(p_osm-sminfo_get_disp);
 
/* dump SA DB */
-   if (p_osm-subn.opt.sa_db_dump)
+   if ((p_osm-sm.p_subn-sm_state == IB_SMINFO_STATE_MASTER) 
+p_osm-subn.opt.sa_db_dump)
osm_sa_db_file_dump(p_osm);
 
/* do the destruction in reverse order as init */
-- 
1.7.8.2

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH opensm] Fix timeout handling for pkeyGet for sw port 0

2013-11-04 Thread Hal Rosenstock

From: Dan Ben Yosef da...@mellanox.com

Remove node in addition to port when getting timeout for
pkeyGet for sw port 0.

Signed-off-by: Dan Ben Yosef da...@mellanox.com
Signed-off-by: Hal Rosenstock h...@mellanox.com
---
diff --git a/opensm/osm_drop_mgr.c b/opensm/osm_drop_mgr.c
index 85a6f58..11c1561 100644
--- a/opensm/osm_drop_mgr.c
+++ b/opensm/osm_drop_mgr.c
@@ -535,7 +535,15 @@ void osm_drop_mgr_process(osm_sm_t * sm)
drop_mgr_process_node(sm, p_node);
else {
/*
-* Drop port if there was timeout for GetPKeyTable
+* We want to preserve the configured pkey indexes,
+* so if we don't receive GetResp P_KeyTable for some 
block,
+* do the following:
+*   1. Drop node if the node is sw and got timeout for 
port 0.
+*   2. Drop node if node is HCA/RTR.
+*   3. Drop only physp if got timeout for sw when the 
port isn't 0.
+* We'll set error during initialization in order to
+* cause an immediate heavy sweep and try to get the
+* configured P_KeyTable again.
 */
if (osm_node_get_type(p_node) == IB_NODE_TYPE_SWITCH)
port_num = 0;
@@ -547,12 +555,12 @@ void osm_drop_mgr_process(osm_sm_t * sm)
if (!p_physp || p_physp-pkeys.rcv_blocks_cnt 
== 0)
continue;
sm-p_subn-subnet_initialization_error = TRUE;
-   if (!port_num || osm_node_get_type(p_node) != 
IB_NODE_TYPE_SWITCH) {
-   port_guid = 
osm_physp_get_port_guid(p_physp);
-   p_port = 
osm_get_port_by_guid(sm-p_subn, port_guid);
-   p_port-discovery_count = 0;
-   } else
+   port_guid = osm_physp_get_port_guid(p_physp);
+   p_port = osm_get_port_by_guid(sm-p_subn, 
port_guid);
+   if (p_node-physp_discovered[port_num]) {
p_node-physp_discovered[port_num] = 0;
+   p_port-discovery_count--;
+   }
}
}
}

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH librdmacm] Makefile.am: Add missing riostream man page to man_MANS

2013-11-04 Thread Hal Rosenstock

Signed-off-by: Hal Rosenstock h...@mellanox.com
---
diff --git a/Makefile.am b/Makefile.am
index 4e3dee7..bf72134 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -125,6 +125,7 @@ man_MANS = \
man/rdma_client.1 \
man/rdma_xserver.1 \
man/rdma_xclient.1 \
+   man/riostream.1 \
man/rstream.1 \
man/rcopy.1 \
man/rdma_cm.7 \
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


AW: AW: IPoIB GRO

2013-11-04 Thread Markus Stockhausen
 Hi Markus,
 
 Can you please tell me what is the FW version you have on your ConnectX
 cards?

of course. the server has:

root@client:~# ibstat
CA 'mlx4_0'
CA type: MT26418
Number of ports: 1
Firmware version: 2.9.1000
Hardware version: a0
Node GUID: 0x0002c903000ec11a
System image GUID: 0x0002c903000ec11d
Port 1:
State: Active
Physical state: LinkUp
Rate: 20
Base lid: 4
LMC: 0
SM lid: 2
Capability mask: 0x02510868
Port GUID: 0x0002c903000ec11b

The  client has an older 2.7.x firmware. Mostly because of
the X58 chipset incompatibility with newer firmwares. Your
question suggests that this behaviour may be related to the 
older firmware. So I changed the client side test to another 
host with newer firmware. Nevertheless the TSO problem 
occurs there too.

root@client:~# ibstat
CA 'mlx4_0'
CA type: MT25418
Number of ports: 2
Firmware version: 2.9.1000
Hardware version: a0
Node GUID: 0x001e0b4cf9c4
System image GUID: 0x001e0b4cf9c7
Port 1:
State: Active
Physical state: LinkUp
Rate: 20
Base lid: 14
LMC: 0
SM lid: 2
Capability mask: 0x02510868
Port GUID: 0x001e0b4cf9c5
Port 2:
State: Down
Physical state: Polling
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02510868
Port GUID: 0x001e0b4cf9c6

Best regards  thanks in advance.

Markus

Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte
Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail
irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und
vernichten Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte
Weitergabe dieser Mail ist nicht gestattet.

Über das Internet versandte E-Mails können unter fremden Namen erstellt oder
manipuliert werden. Deshalb ist diese als E-Mail verschickte Nachricht keine
rechtsverbindliche Willenserklärung.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln

Vorstand:
Kadir Akin
Dr. Michael Höhnerbach

Vorsitzender des Aufsichtsrates:
Hans Kristian Langva

Registergericht: Amtsgericht Köln
Registernummer: HRB 52 497

This e-mail may contain confidential and/or privileged information. If you
are not the intended recipient (or have received this e-mail in error)
please notify the sender immediately and destroy this e-mail. Any
unauthorized copying, disclosure or distribution of the material in this
e-mail is strictly forbidden.

e-mails sent over the internet may have been written under a wrong name or
been manipulated. That is why this message sent as an e-mail is not a
legally binding declaration of intention.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln

executive board:
Kadir Akin
Dr. Michael Höhnerbach

President of the supervisory board:
Hans Kristian Langva

Registry office: district court Cologne
Register number: HRB 52 497




RE: linux-next: build warning after merge of the infiniband tree

2013-11-04 Thread Marciniszyn, Mike
This issue was caught by Tetsuo Handa and Acked on 10/30: 
http://marc.info/?t=13831336458r=1w=2.

Roland, I noticed that the Tetsuo's original message didn't cc the linux-rdma 
list?

Mike

 -Original Message-
 From: Stephen Rothwell [mailto:s...@canb.auug.org.au]
 Sent: Sunday, November 03, 2013 11:55 PM
 To: Roland Dreier; linux-rdma@vger.kernel.org
 Cc: linux-n...@vger.kernel.org; linux-ker...@vger.kernel.org; Jan Kara;
 Marciniszyn, Mike
 Subject: linux-next: build warning after merge of the infiniband tree
 
 Hi all,
 
 After merging the infiniband tree, today's linux-next build (x86_64
 allmodconfig) produced this warning:
 
 drivers/infiniband/hw/ipath/ipath_user_sdma.c: In function
 'ipath_user_sdma_pin_pages':
 drivers/infiniband/hw/ipath/ipath_user_sdma.c:283:6: warning: 'j' is used
 uninitialized in this function [-Wuninitialized]
   ret = get_user_pages_fast(addr, j, 0, pages);
   ^
 
 Introduced by commit 18fec3c6bdcb (IB/ipath: Convert
 ipath_user_sdma_pin_pages() to use get_user_pages_fast()).  How did that
 pass review or testing?
 
 --
 Cheers,
 Stephen Rothwells...@canb.auug.org.au
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC v2 00/10] Introduce Signature feature

2013-11-04 Thread Nicholas A. Bellinger
On Sat, 2013-11-02 at 14:57 -0700, Bart Van Assche wrote:
 On 1/11/2013 18:36, Nicholas A. Bellinger wrote:
  On Fri, 2013-11-01 at 08:03 -0700, Bart Van Assche wrote:
  On 31/10/2013 5:24, Sagi Grimberg wrote:
  In T10-DIF, when a series of 512-byte data blocks are transferred, each
  block is followed by an 8-byte guard. The guard consists of CRC that
  protects the integrity of the data in the block, and some other tags
  that protects against mis-directed IOs.
 
  Shouldn't that read logical block length divided by 2**(protection
  interval exponent) instead of 512 ? From the SPC-4 FORMAT UNIT
  section:
 
  Why should the protection interval in FORMAT_UNIT be mentioned when it's
  not supported by the hardware, nor by drivers/scsi/sd_dif.c itself..?
 
 Hello Nick,
 
 My understanding is that this patch series is not only intended for 
 initiator drivers but also for target drivers like ib_srpt and ib_isert. 
 As you know target drivers do not restrict the initiator operating 
 system to Linux. Although I do not know whether there are already 
 operating systems that support the protection interval exponent,

It's my understanding that Linux is still the only stack that supports
DIF, so AFAICT no one is actually supporting this.

  I think it is a good idea to stay as close as possible to the terminology 
 of the SPC-4 standard.
 

No, in this context it only adds pointless misdirection because 1) The
hardware in question doesn't support it, and 2) Linux itself doesn't
support it.

--nab

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: AW: IPoIB GRO

2013-11-04 Thread Wendy Cheng
I looked at TSO code earlier this year. IIRC, if TSO is on, the upper
layer (e.g. IP) would just send the super-packet down (to IPOIB) w/out
segmentation (for send); if off, it then does the segmentation (to
match the MTU size) before calling device's send. For GSO, I would
imagine it needs some sorts of segmentation sequence to know how to
pull them together on the receive end. Look to me that the
segmentation offload (TSO) and receive offload (GSO) are mutual
exclusive ? Check out dev_gro_receive() (line number based on 2.6.32
RHEL kernel):

   2980
   2981 if (skb_is_gso(skb) || skb_has_frags(skb))
   2982 goto normal;


See how it bails out when TSO (skb_is_gso()) is on ? So it looks like
an IPOIB bug that ipoib_ib_handle_rx_wc() does a unconditional
napi_gro_receive() regardless adapter capability (and TSO setting).

Just a guess !

-- Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html