[PATCH v1 1/1] mlx4: disable TSO for Connect-X rev a0 cards

2014-08-13 Thread Markus Stockhausen
commit 269a731254c4b8daa22ce09108e239dce7e6bb71
Author: Markus Stockhausen markus.stockhau...@collogia.de
Date:   Sun Aug 10 17:25:55 2014 +

mlx4: disable TSO for Connect-X rev a0 cards

According to http://marc.info/?t=13834764094r=1w=2 Connect-X
cards with revision a0 do not correctly assemble TSO packets. So
simply disable that feature.

diff --git a/drivers/infiniband/hw/mlx4/main.c 
b/drivers/infiniband/hw/mlx4/main.c
index 0f7027e..c6fd057 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -59,6 +59,7 @@
 
 #define MLX4_IB_FLOW_MAX_PRIO 0xFFF
 #define MLX4_IB_FLOW_QPN_MASK 0xFF
+#define MLX4_IB_CARD_REV_A0   0xA0
 
 MODULE_AUTHOR(Roland Dreier);
 MODULE_DESCRIPTION(Mellanox ConnectX HCA InfiniBand driver);
@@ -158,7 +159,9 @@ static int mlx4_ib_query_device(struct ib_device *ibdev,
props-device_cap_flags |= IB_DEVICE_UD_AV_PORT_ENFORCE;
if (dev-dev-caps.flags  MLX4_DEV_CAP_FLAG_IPOIB_CSUM)
props-device_cap_flags |= IB_DEVICE_UD_IP_CSUM;
-   if (dev-dev-caps.max_gso_sz  dev-dev-caps.flags  
MLX4_DEV_CAP_FLAG_BLH)
+   if (dev-dev-caps.max_gso_sz 
+   (dev-dev-rev_id != MLX4_IB_CARD_REV_A0) 
+   (dev-dev-caps.flags  MLX4_DEV_CAP_FLAG_BLH))
props-device_cap_flags |= IB_DEVICE_UD_TSO;
if (dev-dev-caps.bmme_flags  MLX4_BMME_FLAG_RESERVED_LKEY)
props-device_cap_flags |= IB_DEVICE_LOCAL_DMA_LKEY;

Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte
Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail
irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und
vernichten Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte
Weitergabe dieser Mail ist nicht gestattet.

Über das Internet versandte E-Mails können unter fremden Namen erstellt oder
manipuliert werden. Deshalb ist diese als E-Mail verschickte Nachricht keine
rechtsverbindliche Willenserklärung.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln

Vorstand:
Kadir Akin
Dr. Michael Höhnerbach

Vorsitzender des Aufsichtsrates:
Hans Kristian Langva

Registergericht: Amtsgericht Köln
Registernummer: HRB 52 497

This e-mail may contain confidential and/or privileged information. If you
are not the intended recipient (or have received this e-mail in error)
please notify the sender immediately and destroy this e-mail. Any
unauthorized copying, disclosure or distribution of the material in this
e-mail is strictly forbidden.

e-mails sent over the internet may have been written under a wrong name or
been manipulated. That is why this message sent as an e-mail is not a
legally binding declaration of intention.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln

executive board:
Kadir Akin
Dr. Michael Höhnerbach

President of the supervisory board:
Hans Kristian Langva

Registry office: district court Cologne
Register number: HRB 52 497




AW: AW: AW: AW: IPoIB GRO

2013-11-05 Thread Markus Stockhausen
  so, to summarize:
  The HW does the work (truncates the big ip packet to series of ip
  packets, each with the relevant mtu size and increases the ip-id for
  each)
  The FW enables that work on the HW
  the FW in A0 card doesn't enable that option for the HW.
 
 Sounds like this bug causes a performance regression, and it sounds
 like it puts incorrect packets on the wire.
 
 This should be patched, have the driver disable TSO for cards that
 can't support it...
 
 Jason

Incredible how a card that does not support TSO can bring big packets
on the wire that somehow get reassembled on the client side :) Maybe 
a two liner in mlx4_ib_query_device() could prevent further discussions.

Markus

Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte
Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail
irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und
vernichten Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte
Weitergabe dieser Mail ist nicht gestattet.

Über das Internet versandte E-Mails können unter fremden Namen erstellt oder
manipuliert werden. Deshalb ist diese als E-Mail verschickte Nachricht keine
rechtsverbindliche Willenserklärung.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln

Vorstand:
Kadir Akin
Dr. Michael Höhnerbach

Vorsitzender des Aufsichtsrates:
Hans Kristian Langva

Registergericht: Amtsgericht Köln
Registernummer: HRB 52 497

This e-mail may contain confidential and/or privileged information. If you
are not the intended recipient (or have received this e-mail in error)
please notify the sender immediately and destroy this e-mail. Any
unauthorized copying, disclosure or distribution of the material in this
e-mail is strictly forbidden.

e-mails sent over the internet may have been written under a wrong name or
been manipulated. That is why this message sent as an e-mail is not a
legally binding declaration of intention.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln

executive board:
Kadir Akin
Dr. Michael Höhnerbach

President of the supervisory board:
Hans Kristian Langva

Registry office: district court Cologne
Register number: HRB 52 497




AW: IPoIB GRO

2013-11-05 Thread Markus Stockhausen
 Von: Or Gerlitz [ogerl...@mellanox.com]
 Gesendet: Mittwoch, 6. November 2013 08:50
 An: Markus Stockhausen; Jason Gunthorpe; Erez Shitrit
 Cc: linux-rdma@vger.kernel.org; Wendy Cheng
 Betreff: Re:  IPoIB GRO
 
 On 05/11/2013 20:08, Markus Stockhausen wrote:
  Incredible how a card that does not support TSO can bring big packets
  on the wire that somehow get reassembled on the client side
not sure to follow, you have shown they are **not**  reassembled, correct?

Sorry for being not correct. I meant that activating TSO
on that particular card seems to be nothing more than 
creating fragments. They are reassembled but not in the
GRO path. From my stupid point of view that could have 
resulted in much more problems than GRO not working 
correctly. 

Markus


Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte
Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail
irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und
vernichten Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte
Weitergabe dieser Mail ist nicht gestattet.

Über das Internet versandte E-Mails können unter fremden Namen erstellt oder
manipuliert werden. Deshalb ist diese als E-Mail verschickte Nachricht keine
rechtsverbindliche Willenserklärung.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln

Vorstand:
Kadir Akin
Dr. Michael Höhnerbach

Vorsitzender des Aufsichtsrates:
Hans Kristian Langva

Registergericht: Amtsgericht Köln
Registernummer: HRB 52 497

This e-mail may contain confidential and/or privileged information. If you
are not the intended recipient (or have received this e-mail in error)
please notify the sender immediately and destroy this e-mail. Any
unauthorized copying, disclosure or distribution of the material in this
e-mail is strictly forbidden.

e-mails sent over the internet may have been written under a wrong name or
been manipulated. That is why this message sent as an e-mail is not a
legally binding declaration of intention.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln

executive board:
Kadir Akin
Dr. Michael Höhnerbach

President of the supervisory board:
Hans Kristian Langva

Registry office: district court Cologne
Register number: HRB 52 497




AW: IPoIB GRO

2013-11-04 Thread Markus Stockhausen
 Thats why the flush flag is always set and the GRO stack does
 not work at all. I'm willing to dig deeper into this but I'm unsure
 if those fields are filled on sender or receiver side and especially
 where in the IPoIB stack. 

Maybe I got the reason for that strange ack behaviour during
large NFS over IPoIB reads and hopefully someone can confirm this

If I turn on TSO for an IPoIB datagram interface on the sender side 
GRO on the receiver side is totally broken. This due to the fact that 
TSO generates large 60k packets that are offloaded into 
fragments. Each of these fragments has the same ID in the packet 
header. GRO expects IDs to be in incremental order and issues a 
flush after each package. Each flush results in an ACK packet back 
to the server.

With TSO disabled GRO can kick in. Packets are build with 
sequential IDs. GRO only acknowledges every few packets.

For a fully cached file read of 6GB the numbers read:

TSO on: ~220MByte/s - 1,522,679 MLX4 Interrupts on server
TSO off: ~550MByte/s - 318,322 MLX4 Interrupts on server

Is there any chance IPoIB TSO handling can be optimized?

Markus

Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte
Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail
irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und
vernichten Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte
Weitergabe dieser Mail ist nicht gestattet.

Über das Internet versandte E-Mails können unter fremden Namen erstellt oder
manipuliert werden. Deshalb ist diese als E-Mail verschickte Nachricht keine
rechtsverbindliche Willenserklärung.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln

Vorstand:
Kadir Akin
Dr. Michael Höhnerbach

Vorsitzender des Aufsichtsrates:
Hans Kristian Langva

Registergericht: Amtsgericht Köln
Registernummer: HRB 52 497

This e-mail may contain confidential and/or privileged information. If you
are not the intended recipient (or have received this e-mail in error)
please notify the sender immediately and destroy this e-mail. Any
unauthorized copying, disclosure or distribution of the material in this
e-mail is strictly forbidden.

e-mails sent over the internet may have been written under a wrong name or
been manipulated. That is why this message sent as an e-mail is not a
legally binding declaration of intention.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln

executive board:
Kadir Akin
Dr. Michael Höhnerbach

President of the supervisory board:
Hans Kristian Langva

Registry office: district court Cologne
Register number: HRB 52 497




AW: IPoIB GRO

2013-11-04 Thread Markus Stockhausen
Hi Erez,

 don't you see that behaviour in tcpdump? what kernel are you using?

On server side we have a 3.5 on client side a 3.11 kernel each of them with
kernel standard drivers/modules. I can see the same pattern of GRO 
aggregation on the client that you mention but only if I disable TSO for 
ib0 on the server side. 

The test I'm running on the client is like this. The second and third read
run are definetly served by the NFS server side cache.

sysctl -w net.ipv4.tcp_mem=4096 65536 4194304
sysctl -w net.ipv4.tcp_rmem=4096 65536 4194304
sysctl -w net.ipv4.tcp_wmem=4096 65536 4194304
sysctl -w net.core.rmem_max=8388608
sysctl -w net.core.wmem_max=8388608

mount -o nfsvers=3,rsize=262144,wsize=262144 10.10.30.251:/export /mnt
echo 3  /proc/sys/vm/drop_caches
dd if=/mnt/xxx.iso of=/dev/null bs=1M count=5000
echo 3  /proc/sys/vm/drop_caches
dd if=/mnt/xxx.iso of=/dev/null bs=1M count=5000
echo 3  /proc/sys/vm/drop_caches
dd if=/mnt/xxx.iso of=/dev/null bs=1M count=5000
umount /mnt

Markus

Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte
Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail
irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und
vernichten Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte
Weitergabe dieser Mail ist nicht gestattet.

Über das Internet versandte E-Mails können unter fremden Namen erstellt oder
manipuliert werden. Deshalb ist diese als E-Mail verschickte Nachricht keine
rechtsverbindliche Willenserklärung.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln

Vorstand:
Kadir Akin
Dr. Michael Höhnerbach

Vorsitzender des Aufsichtsrates:
Hans Kristian Langva

Registergericht: Amtsgericht Köln
Registernummer: HRB 52 497

This e-mail may contain confidential and/or privileged information. If you
are not the intended recipient (or have received this e-mail in error)
please notify the sender immediately and destroy this e-mail. Any
unauthorized copying, disclosure or distribution of the material in this
e-mail is strictly forbidden.

e-mails sent over the internet may have been written under a wrong name or
been manipulated. That is why this message sent as an e-mail is not a
legally binding declaration of intention.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln

executive board:
Kadir Akin
Dr. Michael Höhnerbach

President of the supervisory board:
Hans Kristian Langva

Registry office: district court Cologne
Register number: HRB 52 497




AW: AW: IPoIB GRO

2013-11-04 Thread Markus Stockhausen
 Hi Markus,
 
 Can you please tell me what is the FW version you have on your ConnectX
 cards?

of course. the server has:

root@client:~# ibstat
CA 'mlx4_0'
CA type: MT26418
Number of ports: 1
Firmware version: 2.9.1000
Hardware version: a0
Node GUID: 0x0002c903000ec11a
System image GUID: 0x0002c903000ec11d
Port 1:
State: Active
Physical state: LinkUp
Rate: 20
Base lid: 4
LMC: 0
SM lid: 2
Capability mask: 0x02510868
Port GUID: 0x0002c903000ec11b

The  client has an older 2.7.x firmware. Mostly because of
the X58 chipset incompatibility with newer firmwares. Your
question suggests that this behaviour may be related to the 
older firmware. So I changed the client side test to another 
host with newer firmware. Nevertheless the TSO problem 
occurs there too.

root@client:~# ibstat
CA 'mlx4_0'
CA type: MT25418
Number of ports: 2
Firmware version: 2.9.1000
Hardware version: a0
Node GUID: 0x001e0b4cf9c4
System image GUID: 0x001e0b4cf9c7
Port 1:
State: Active
Physical state: LinkUp
Rate: 20
Base lid: 14
LMC: 0
SM lid: 2
Capability mask: 0x02510868
Port GUID: 0x001e0b4cf9c5
Port 2:
State: Down
Physical state: Polling
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02510868
Port GUID: 0x001e0b4cf9c6

Best regards  thanks in advance.

Markus

Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte
Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail
irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und
vernichten Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte
Weitergabe dieser Mail ist nicht gestattet.

Über das Internet versandte E-Mails können unter fremden Namen erstellt oder
manipuliert werden. Deshalb ist diese als E-Mail verschickte Nachricht keine
rechtsverbindliche Willenserklärung.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln

Vorstand:
Kadir Akin
Dr. Michael Höhnerbach

Vorsitzender des Aufsichtsrates:
Hans Kristian Langva

Registergericht: Amtsgericht Köln
Registernummer: HRB 52 497

This e-mail may contain confidential and/or privileged information. If you
are not the intended recipient (or have received this e-mail in error)
please notify the sender immediately and destroy this e-mail. Any
unauthorized copying, disclosure or distribution of the material in this
e-mail is strictly forbidden.

e-mails sent over the internet may have been written under a wrong name or
been manipulated. That is why this message sent as an e-mail is not a
legally binding declaration of intention.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln

executive board:
Kadir Akin
Dr. Michael Höhnerbach

President of the supervisory board:
Hans Kristian Langva

Registry office: district court Cologne
Register number: HRB 52 497




IPoIB GRO

2013-11-03 Thread Markus Stockhausen
Hello,

I have a little update to the unlucky GRO IPoIB behaviour I observed 
in the last weeks in datagram mode on our ConnectX cards. In the
GRO receive path the kernel steps into the inet_gro_receive() function
of net/ipv4/af_inet.c. If I read the code right it compares two
IP packets and decides if they come from the same flow. 
Further checks are included in some subroutines that narrow
down the comparison to IPv4 and so on.

I put a debugging message into the following comparison that
seems to be the culprit of it all. 

inet_gro_receive()
  ...
  /* All fields must match except length and checksum. */
  NAPI_GRO_CB(p)-flush |=
(iph-ttl ^ iph2-ttl) |
(iph-tos ^ iph2-tos) |
(__force int)((iph-frag_off ^ iph2-frag_off)  htons(IP_DF)) |
((u16)(ntohs(iph2-id) + NAPI_GRO_CB(p)-count) ^ id);
  /* Do some debug */
  printk(%i %i %i\n,ntohs(iph2-id),NAPI_GRO_CB(p)-count,id);
  ...

On a normal GBit Intel card the kernel output reads:

32933 12 32945 
32933 13 32946
32946 1 32947
32946 2 32948
...
32946 15 32961
32964 3 32967
32964 4 32968
...

The interpretation of it all should be that packet ids must match 
the sum of the initial packet id plus its count field. Then
we have a GRO candidate.

On our ib0 interface the count field of a received packet seems
to be 1 most of the time and the packet id always matches the
initial packet id:

35754 1 35754
35754 1 35754
35754 1 35754
...
35754 1 35786
35786 1 35786
35786 1 35786
...

Thats why the flush flag is always set and the GRO stack does
not work at all. I'm willing to dig deeper into this but I'm unsure 
if those fields are filled on sender or receiver side and especially 
where in the IPoIB stack. Maybe someone can point me into the
right direction so that I can dig deeper and provide some more 
information.

Bet regards.

Markus


Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte
Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail
irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und
vernichten Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte
Weitergabe dieser Mail ist nicht gestattet.

Über das Internet versandte E-Mails können unter fremden Namen erstellt oder
manipuliert werden. Deshalb ist diese als E-Mail verschickte Nachricht keine
rechtsverbindliche Willenserklärung.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln

Vorstand:
Kadir Akin
Dr. Michael Höhnerbach

Vorsitzender des Aufsichtsrates:
Hans Kristian Langva

Registergericht: Amtsgericht Köln
Registernummer: HRB 52 497

This e-mail may contain confidential and/or privileged information. If you
are not the intended recipient (or have received this e-mail in error)
please notify the sender immediately and destroy this e-mail. Any
unauthorized copying, disclosure or distribution of the material in this
e-mail is strictly forbidden.

e-mails sent over the internet may have been written under a wrong name or
been manipulated. That is why this message sent as an e-mail is not a
legally binding declaration of intention.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln

executive board:
Kadir Akin
Dr. Michael Höhnerbach

President of the supervisory board:
Hans Kristian Langva

Registry office: district court Cologne
Register number: HRB 52 497




AW: ACK behaviour difference LRO/GRO

2013-10-29 Thread Markus Stockhausen
 
 Von: Or Gerlitz [ogerl...@mellanox.com]
 Gesendet: Dienstag, 29. Oktober 2013 09:31
 An: Markus Stockhausen; linux-rdma@vger.kernel.org; Yishai Hadas
 Cc: s.wendy.ch...@gmail.com; Erez Shitrit; Saeed Mahameed
 Betreff: Re: ACK behaviour difference LRO/GRO
 
 On 28/10/2013 21:34, Markus Stockhausen wrote:
  After some quite hard test iterations the problem seems to come from the
  IPoIB switch from LRO to GRO between kernels 2.6.37 and 2.6.38.
 
  I built a test setup with a 2.6.38 kernel and additionaly compiled a 2.6.37
  ib_ipoib module against it. This way I can run a direct comparison
  between the old and new module. The major difference between the
  two version is inside the ipoib_ib_handle_rx_wc() function:
 
  2.6.37: lro_receive_skb(priv-lro.lro_mgr, skb, NULL);
  2.6.38: napi_gro_receive(priv-napi, skb);
 
 These two commits that went in 3.3
 
 936d7de IPoIB: Stop lying about hard_header_len and use skb-cb to stash
 LL addresses
 a0417fa net: Make qdisc_skb_cb upper size bound explicit
 
 were supposed to make IPoIB/GRO to work properly, specifically with
 them, you should see aggregation coming into play
 
 I think Yishai Hadas from Mellanox was looking on that too, do we have
 any insights on the matter?
 
 Or.

At least for the 2.6.38 that sounds clear. My initial post was about
3.5. and 3.10 test kernels that showed the missing aggregation. So
I'm still a bit away from a solution. I will try to get the test machine
back to 3.10/3.11 to validate it once again. 

Just to be on the right way: What are the basics to get GRO working 
with a ConnectX (not 2 or 3) card in 2044 MTU datagram mode? 

- enable GRO with ethtool.
- Activate Coalescing with ethtool? If yes how?

Thanks for the help.

Markus

Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte
Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail
irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und
vernichten Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte
Weitergabe dieser Mail ist nicht gestattet.

Über das Internet versandte E-Mails können unter fremden Namen erstellt oder
manipuliert werden. Deshalb ist diese als E-Mail verschickte Nachricht keine
rechtsverbindliche Willenserklärung.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln

Vorstand:
Kadir Akin
Dr. Michael Höhnerbach

Vorsitzender des Aufsichtsrates:
Hans Kristian Langva

Registergericht: Amtsgericht Köln
Registernummer: HRB 52 497

This e-mail may contain confidential and/or privileged information. If you
are not the intended recipient (or have received this e-mail in error)
please notify the sender immediately and destroy this e-mail. Any
unauthorized copying, disclosure or distribution of the material in this
e-mail is strictly forbidden.

e-mails sent over the internet may have been written under a wrong name or
been manipulated. That is why this message sent as an e-mail is not a
legally binding declaration of intention.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln

executive board:
Kadir Akin
Dr. Michael Höhnerbach

President of the supervisory board:
Hans Kristian Langva

Registry office: district court Cologne
Register number: HRB 52 497




AW: AW: ACK behaviour difference LRO/GRO

2013-10-29 Thread Markus Stockhausen
 Von: Or Gerlitz [ogerl...@mellanox.com]
 Gesendet: Dienstag, 29. Oktober 2013 14:58
 An: Erez Shitrit
 Cc: Markus Stockhausen; linux-rdma@vger.kernel.org; Yishai Hadas; 
 s.wendy.ch...@gmail.com; Erez Shitrit; Saeed Mahameed
 Betreff: Re: AW: ACK behaviour difference LRO/GRO
 
 On 29/10/2013 14:55, Erez Shitrit wrote:
  In addition to what Or just wrote,
  GRO currently doesn't work on ipoib interfaces, that according to bad
  handling mac address that are not 6 bytes (we have plans to fix that
  in the near future), that is the reason you don't see 64k packets on
  tcpdump (like you see in LRO).
 
 I just checked with net-next which is 3.12-rc6+ and there IS
 aggregationfor datagram mode
 
  15:56:40.983883 IP 192.168.20.18.55714  192.168.20.17.40861: Flags
  [.], seq 1801688305:1801692289, ack 1, win 220, options [nop,nop,TS
  val 44014459 ecr 305403520], length 3984
  15:56:40.983942 IP 192.168.20.18.55714  192.168.20.17.40861: Flags
  [.], seq 1801692289:1801756033, ack 1, win 220, options [nop,nop,TS
  val 44014459 ecr 305403520], length 63744
  15:56:40.984027 IP 192.168.20.18.55714  192.168.20.17.40861: Flags
  [.], seq 1801756033:1801819777, ack 1, win 220, options [nop,nop,TS
  val 44014459 ecr 305403520], length 63744
  15:56:40.984079 IP 192.168.20.17.40861  192.168.20.18.55714: Flags
  [.], ack 1801688305, win 1544, options [nop,nop,TS val 305403520 ecr
  44014459], length 0
  15:56:40.984104 IP 192.168.20.18.55714  192.168.20.17.40861: Flags
  [.], seq 1801819777:1801823649, ack 1, win 220, options [nop,nop,TS
  val 44014459 ecr 305403520], length 3872
  15:56:40.984159 IP 192.168.20.18.55714  192.168.20.17.40861: Flags
  [.], seq 1801823649:1801883521, ack 1, win 220, options [nop,nop,TS
  val 44014459 ecr 305403520], length 59872
  15:56:40.984214 IP 192.168.20.17.40861  192.168.20.18.55714: Flags
  [.], ack 1801819777, win 1009, options [nop,nop,TS val 305403520 ecr
  44014459], length 0
  15:56:40.984241 IP 192.168.20.18.55714  192.168.20.17.40861: Flags
  [.], seq 1801883521:1801887393, ack 1, win 220, options [nop,nop,TS
  val 44014459 ecr 305403520], length 3872
 

Thanks to both of you for that clarification. Nevertheless this
info seems a little contradictionary. Should I exepect GRO to
work on Mellanox IB cards with Linux 3.12 in general? Or is 
this only an effect because you have test cards with good
mac addresses?

Markus

Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte
Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail
irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und
vernichten Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte
Weitergabe dieser Mail ist nicht gestattet.

Über das Internet versandte E-Mails können unter fremden Namen erstellt oder
manipuliert werden. Deshalb ist diese als E-Mail verschickte Nachricht keine
rechtsverbindliche Willenserklärung.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln

Vorstand:
Kadir Akin
Dr. Michael Höhnerbach

Vorsitzender des Aufsichtsrates:
Hans Kristian Langva

Registergericht: Amtsgericht Köln
Registernummer: HRB 52 497

This e-mail may contain confidential and/or privileged information. If you
are not the intended recipient (or have received this e-mail in error)
please notify the sender immediately and destroy this e-mail. Any
unauthorized copying, disclosure or distribution of the material in this
e-mail is strictly forbidden.

e-mails sent over the internet may have been written under a wrong name or
been manipulated. That is why this message sent as an e-mail is not a
legally binding declaration of intention.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln

executive board:
Kadir Akin
Dr. Michael Höhnerbach

President of the supervisory board:
Hans Kristian Langva

Registry office: district court Cologne
Register number: HRB 52 497




Re: AW: AW: ACK behaviour difference LRO/GRO

2013-10-29 Thread Markus Stockhausen
 Von: Or Gerlitz [ogerl...@mellanox.com]
 Gesendet: Dienstag, 29. Oktober 2013 16:55
 An: Markus Stockhausen; Erez Shitrit
 Cc: linux-rdma@vger.kernel.org; Yishai Hadas; s.wendy.ch...@gmail.com; Saeed 
 Mahameed
 Betreff: Re: AW: AW: ACK behaviour difference LRO/GRO
 
 On 29/10/2013 17:54, Markus Stockhausen wrote:
  Should I exepect GRO to work on Mellanox IB cards with Linux 3.12 in 
  general?
 
 YES

I'm so sorry to tell you that a jump on the 3.12 rc6 bandwagon did not
gave a better picture. To sum up the most recent situation: 

- NFS server based on a 3.5 kernel 
- Server has an older Xeon L5420
- NFS client based on a 3.12 rc6 kernel - NFS version 3/4 does not matter 
- Client has a newer Core i7
- Both ends use datagram mode with 2K MTU 
- Client reads data from NFS (RAM) with dd into null device
- tcpdump is running on the server side.

From the dump I can see most of the time about 20 incoming ack 
packets for one sent out data packet of size 64k. Statistics from 
/proc/interrupts on the server side give a result that fits the picture. 
~87 interrupts for 4GB transferred data. That is round about one 
interrupt per 4,5K. Not very scientific but at least an idea of what is 
going on.

With the mac address explanation of Erez in mind I would confirm that
there is still room to improve GRO even in newest kernels.

Markus 



Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte
Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail
irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und
vernichten Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte
Weitergabe dieser Mail ist nicht gestattet.

Über das Internet versandte E-Mails können unter fremden Namen erstellt oder
manipuliert werden. Deshalb ist diese als E-Mail verschickte Nachricht keine
rechtsverbindliche Willenserklärung.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln

Vorstand:
Kadir Akin
Dr. Michael Höhnerbach

Vorsitzender des Aufsichtsrates:
Hans Kristian Langva

Registergericht: Amtsgericht Köln
Registernummer: HRB 52 497

This e-mail may contain confidential and/or privileged information. If you
are not the intended recipient (or have received this e-mail in error)
please notify the sender immediately and destroy this e-mail. Any
unauthorized copying, disclosure or distribution of the material in this
e-mail is strictly forbidden.

e-mails sent over the internet may have been written under a wrong name or
been manipulated. That is why this message sent as an e-mail is not a
legally binding declaration of intention.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln

executive board:
Kadir Akin
Dr. Michael Höhnerbach

President of the supervisory board:
Hans Kristian Langva

Registry office: district court Cologne
Register number: HRB 52 497




ACK behaviour difference LRO/GRO

2013-10-28 Thread Markus Stockhausen
Hello,

about two month we had some problems with IPoIB transfer speeds . 
See more http://marc.info/?l=linux-rdmam=137823326109158w=2
After some quite hard test iterations the problem seems to come from the 
IPoIB switch from LRO to GRO between kernels 2.6.37 and 2.6.38. 

I built a test setup with a 2.6.38 kernel and additionaly compiled a 2.6.37
ib_ipoib module against it. This way I can run a direct comparison
between the old and new module. The major difference between the 
two version is inside the ipoib_ib_handle_rx_wc() function:

2.6.37: lro_receive_skb(priv-lro.lro_mgr, skb, NULL);
2.6.38: napi_gro_receive(priv-napi, skb);

As in the last post we use ConnectX cards in datagram mode with a
2044 MTU.  We read a file sequentially from a NFS server into /dev/null. 
We just want to get the wire speed neglecting hard drives. The 
hardware is slightly newer so we get different transfer speeds but 
the overall effect should be evident. The server uses a 3.5 kernel and 
is not changed during the tests.

With 2.6.37 IPoIB module on the client side and LRO enabled the 
speed is 950 MByte/sec. On the NFS server side a tcpdump trace 
reads like:

19:51:51.432630 IP 10.10.30.251.nfs  10.10.30.1.781: 
  Flags [P.], seq 1008434065:1008497161, ack 617432, 
  win 688, options [nop,nop,TS val 133047292 ecr 429568], 
  length 63096
19:51:51.432672 IP 10.10.30.1.781  10.10.30.251.nfs: 
  Flags [.], ack 1008241041, win 24576, options 
  [nop,nop,TS val 429568 ecr 133047292], length 0
19:51:51.432677 IP 10.10.30.251.nfs  10.10.30.1.781: 
  Flags [.], seq 1008497161:1008560905, ack 617432, 
  win 688, options [nop,nop,TS val 133047292 ecr 429568], 
  length 63744
19:51:51.432725 IP 10.10.30.1.781  10.10.30.251.nfs: 
  Flags [.], ack 1008304585, win 24576, options 
  [nop,nop,TS val 429568 ecr 133047292], length 0
19:51:51.432729 IP 10.10.30.251.nfs  10.10.30.1.781: 
  Flags [.], seq 1008560905:1008624649, ack 617432, 
  win 688, options [nop,nop,TS val 133047292 ecr 429568], 
length 63744

With some slight differences here and there the client sends only
1 ack for about 60k of transferred data. With 2.6.38 module and
onwards (GRO enabled) the speed drops down to 380 MByte/sec 
and a different transfer pattern.

19:58:14.631430 IP 10.10.30.251.nfs  10.10.30.1.ircs: 
  Flags [.], seq 722492293:722502253, ack 442312, win 537, 
  options [nop,nop,TS val 133143092 ecr 467889], length 9960
19:58:14.631460 IP 10.10.30.1.ircs  10.10.30.251.nfs: 
  Flags [.], ack 722478181, win 24562, options 
  [nop,nop,TS val 467889 ecr 133143092], length 0
19:58:14.631485 IP 10.10.30.1.ircs  10.10.30.251.nfs: 
  Flags [.], ack 722478181, win 24562, options 
  [nop,nop,TS val 467889 ecr 133143092,nop,nop,sack 1 
  {722480117:722482333}], length 0
19:58:14.631510 IP 10.10.30.1.ircs  10.10.30.251.nfs: 
  Flags [.], ack 722488197, win 24562, options [nop,nop,TS 
  val 467889 ecr 133143092], length 0
19:58:14.631534 IP 10.10.30.1.ircs  10.10.30.251.nfs: 
  Flags [.], ack 722494229, win 24562, options 
  [nop,nop,TS val 467889 ecr 133143092], length 0

It seems as if the NFS client acknowledges every 2K packet
separately. I thought that it may come from missing 
coalescing parameters and tried a  ethtool -C ib0 rx-usecs 5
on both machines but without success.

I'm quite lost now maybe someone can give a tip if I'm 
missing something.

Best regards.

Markus
Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte
Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail
irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und
vernichten Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte
Weitergabe dieser Mail ist nicht gestattet.

Über das Internet versandte E-Mails können unter fremden Namen erstellt oder
manipuliert werden. Deshalb ist diese als E-Mail verschickte Nachricht keine
rechtsverbindliche Willenserklärung.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln

Vorstand:
Kadir Akin
Dr. Michael Höhnerbach

Vorsitzender des Aufsichtsrates:
Hans Kristian Langva

Registergericht: Amtsgericht Köln
Registernummer: HRB 52 497

This e-mail may contain confidential and/or privileged information. If you
are not the intended recipient (or have received this e-mail in error)
please notify the sender immediately and destroy this e-mail. Any
unauthorized copying, disclosure or distribution of the material in this
e-mail is strictly forbidden.

e-mails sent over the internet may have been written under a wrong name or
been manipulated. That is why this message sent as an e-mail is not a
legally binding declaration of intention.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln

executive board:
Kadir Akin
Dr. Michael Höhnerbach

President of the supervisory board:
Hans Kristian Langva

Registry office: district court Cologne
Register number: HRB 52 497


AW: Strange NFS client ACK behaviour

2013-09-08 Thread Markus Stockhausen
 Von: Wendy Cheng [s.wendy.ch...@gmail.com]
 Gesendet: Donnerstag, 5. September 2013 18:42
 An: Markus Stockhausen
 Cc: linux-rdma@vger.kernel.org; linux-...@vger.kernel.org
 Betreff: Re: Strange NFS client ACK behaviour
 
 CC linux-nfs .. maybe this is obvious to someone there ... Two
 comments inlined below.
 
 On Tue, Sep 3, 2013 at 11:28 AM, Markus Stockhausen
 stockhau...@collogia.de wrote:
  Hello,
 
  we observed a performance drop in our IPoIB NFS backup
  infrastructure since we switched to machines with newer
  kernels. As I do not know where to start I hope someone
  on this list can give me hint where to dig for more details.
 
 In case of no other reply, I would start w/ a socket program (or a
 network performance measuring tool) on the interface that does similar
 logic as dd you described below; that is, send a 256K message in a
 fixed number of loops (so total transfer size somewhere close to your
 file size) between client and server, followed by comparing the
 interrupt counters (cat /proc/interrtups) on both kernels. If the
 interrupt count differs as you described, the problem is most likely
 with the IB driver, not NFS layer.
 
 
  To make a long story short. We use ConnectX cards with the
  standard kernel drivers on version 2.6.32 (Ubuntu 10.04), 3.5
  (Ubuntu 12.04) and 3.10 (Fedora 19). The very simple and not
  scientific test consists of mounting a NFS share using IPoIB UD
  network interfaces at MTU of 2044. Afterwards read a large file
  on the client side with dd if=file of=/dev/null bs=256K.
  During the transfer we run a tcpdump on the ibX interface on
  the NFS server side. No special settings for kernel parameters
  until now.
 
 I don't know much about ConnectX. Not sure what IPoIB UD means ?
 Datagram vs. CM or TCP vs. UDP ?
 

Hello,

thanks for stopping by. I followed your advise and compared the
behaviour to a tcpdump of a netserver/netperf session. That reads
more normal:

20:06:50.397472 IP server.43845  client.58489: seq ... length 63744
20:06:50.397531 IP client.58489  server.43845: ack ... length 0
20:06:50.397567 IP server.43845  client.58489: seq ... length 63744
20:06:50.397595 IP server.43845  client.58489: seq ... length 63744
20:06:50.397622 IP server.43845  client.58489: seq ... length 63744
20:06:50.397632 IP client.58489  server.43845: ack ... length 0
20:06:50.397667 IP server.43845  client.58489: seq ... length 63744
20:06:50.397715 IP client.58489  server.43845: ack ... length 0
20:06:50.397723 IP client.58489  server.43845: ack ... length 0

It is not really comparable as we se transfers to several ports
but at least a much better ratio between send and ack packets.
Seems like further digging in NFS behaviour is required. But
that may be another story.

Markus



Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte
Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail
irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und
vernichten Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte
Weitergabe dieser Mail ist nicht gestattet.

Über das Internet versandte E-Mails können unter fremden Namen erstellt oder
manipuliert werden. Deshalb ist diese als E-Mail verschickte Nachricht keine
rechtsverbindliche Willenserklärung.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln

Vorstand:
Kadir Akin
Dr. Michael Höhnerbach

Vorsitzender des Aufsichtsrates:
Hans Kristian Langva

Registergericht: Amtsgericht Köln
Registernummer: HRB 52 497

This e-mail may contain confidential and/or privileged information. If you
are not the intended recipient (or have received this e-mail in error)
please notify the sender immediately and destroy this e-mail. Any
unauthorized copying, disclosure or distribution of the material in this
e-mail is strictly forbidden.

e-mails sent over the internet may have been written under a wrong name or
been manipulated. That is why this message sent as an e-mail is not a
legally binding declaration of intention.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln

executive board:
Kadir Akin
Dr. Michael Höhnerbach

President of the supervisory board:
Hans Kristian Langva

Registry office: district court Cologne
Register number: HRB 52 497




Strange NFS client ACK behaviour

2013-09-03 Thread Markus Stockhausen
Hello,

we observed a performance drop in our IPoIB NFS backup
infrastructure since we switched to machines with newer 
kernels. As I do not know where to start I hope someone 
on this list can give me hint where to dig for more details.

To make a long story short. We use ConnectX cards with the 
standard kernel drivers on version 2.6.32 (Ubuntu 10.04), 3.5 
(Ubuntu 12.04) and 3.10 (Fedora 19). The very simple and not 
scientific test consists of mounting a NFS share using IPoIB UD 
network interfaces at MTU of 2044. Afterwards read a large file 
on the client side with dd if=file of=/dev/null bs=256K. 
During the transfer we run a tcpdump on the ibX interface on 
the NFS server side. No special settings for kernel parameters 
until now.

When doing the test with a 2.6.32 kernel based client we see the
following packet sequence. More or less a lot of transferd blocks
from the NFS server to the client with sometimes an ACK package
from the client to the server:

16:16:45.050930 IP server.nfs  cli_2_6_32.896: 
  Flags [.], seq 8909853:8913837, ack 1154149, 
  win 604, options [nop,nop,TS val 1640401415 
  ecr 3881919089], length 3984
16:16:45.050936 IP server.nfs  cli_2_6_32.896: 
  Flags [.], seq 8913837:8917821, ack 1154149, 
  win 604, options [nop,nop,TS val 1640401415  
  ecr 3881919089], length 3984

... 8 more ...

16:16:45.050976 IP cli_2_6_32.896  server.nfs: 
  Flags [.], ack 8909853, win 24574, options 
  [nop,nop,TS val 3881919089 ecr 1640401415], 
  length 0
...

After switchng to a client with a newer kernel (3.5 or 3.10) the
sequence all of a sudden gives just the opposite behaviour.
One should note that this is the same server as in the test
above. The server sends bigger packets (I guess TSO is doing
the rest of the work). After each packet the client sends 
several ACK packages back.

16:15:21.038782 IP server.nfs  cli_3_5_0.928: 
  Flags [.], seq 9612429:9652269, ack 372776, 
  win 5815, options [nop,nop,TS val 1640380412 
  ecr 560111379], length 39840
16:15:21.038806 IP cli_3_5_0.928  server.nfs: 
  Flags [.], ack 9542205, win 16384, options 
  [nop,nop,TS val 560111379 ecr 1640380412], 
  length 0
16:15:21.038812 IP cli_3_5_0.928  server.nfs: 
  Flags [.], ack 9546077, win 16384, options 
  [nop,nop,TS val 560111379 ecr 1640380412], 
length 0

... 6-8 more ...

The visible side effects of this changed processing include:
- NIC interrupts on the NFS servers raise by a factor of 8. 
- Transfer speed lowers by 50% (400-200 MB/sec)

Best regards.

Markus
Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte
Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail
irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und
vernichten Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte
Weitergabe dieser Mail ist nicht gestattet.

Über das Internet versandte E-Mails können unter fremden Namen erstellt oder
manipuliert werden. Deshalb ist diese als E-Mail verschickte Nachricht keine
rechtsverbindliche Willenserklärung.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln

Vorstand:
Kadir Akin
Dr. Michael Höhnerbach

Vorsitzender des Aufsichtsrates:
Hans Kristian Langva

Registergericht: Amtsgericht Köln
Registernummer: HRB 52 497

This e-mail may contain confidential and/or privileged information. If you
are not the intended recipient (or have received this e-mail in error)
please notify the sender immediately and destroy this e-mail. Any
unauthorized copying, disclosure or distribution of the material in this
e-mail is strictly forbidden.

e-mails sent over the internet may have been written under a wrong name or
been manipulated. That is why this message sent as an e-mail is not a
legally binding declaration of intention.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln

executive board:
Kadir Akin
Dr. Michael Höhnerbach

President of the supervisory board:
Hans Kristian Langva

Registry office: district court Cologne
Register number: HRB 52 497




IPoIB header prefetch for fragmented SKB

2013-04-17 Thread Markus Stockhausen



That's probably because of a cache line miss.

The thing I don't really understand is that normally, the first cache
line (64 bytes) contains both the Ethernet header and IPv4 header.

So what does this adapter in this respect ?

I guess you should try to use IPOIB_UD_HEAD_SIZE=64 to use the whole
cache line.

Many drivers use prefetch() to make sure cpu starts to bring this
cache line into cache as soon as possible.

A single prefetch() call at the right place might help a lot.

Hello,

@Eric: Thanks for the tip.

In the 4K MAX MTU IPoIB driver path ipoib_ib_handle_rx_wc() will
produce an empty skb linear part with the whole data placed into
the first fragment. napi_gro_receive() finally pulls the IP
header out of the fragment into the linear part.

As far as I understand the pull out of the fragment should come
without additional cost when one calls a prefetch long before
the skb_pull(). 

I'm willing to check this out but I'm unsure if the IP header
is aligned to a cache line of 64 bytes. As a simple guess I
would implement the prefetch here:

static void ipoib_ib_handle_rx_wc();
  ...
  skb_pull (skb, IB_GRH_BYTES);

  skb-protocol = ((struct ipoib_header *) skb-data)-proto;
  skb_reset_mac_header(skb);
  skb_pull(skb, IPOIB_ENCAP_LEN);
+
+ if (ipoib_ud_need_sg(priv-max_ib_mtu))
+   prefetch(whatever address);
  ... 

Can you give me a hint what address one should put into the call?

Thanks in advance.

Markus


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH v3] IPoIB: Leave space in skb linear buffer for IP headers

2013-04-16 Thread Markus Stockhausen



I only wonder what the effect on performance would be with an IB MTU
of 4K active; then full-sized packets would be pretty much exactly
split between the linear part and the fragment page.  How does GRO
cope with that?  I guess in the 2K IB MTU case there's no cost in
having all the data in the linear part of the skb.

 - R.


Hm,

if I think about the current situation we can make it only better.
When receiving a packet on a 4K HCA we have to pull the IP header
into the linear part of the SKB during GRO handling. That consumes
extra CPU cycles and does not depend on the packet size.

We can avoid this by splitting the packet at a well defined position.
Your patch made a cut at x+128 bytes. From my understanding the
position should have no performance impact. The 3 cases that we
analyzed up to now are:

- 2K fragment + header pull = fast
- header and some data in linear part + 1,9K fragment = faster
- only linear part + no fragment = fastest

Maybe I'm too hasty (without a 4K MTU test environment) but from
the above I would derive that larger packets will still benefit
from the adapted handling.

Markus


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH v4] IPoIB: Leave space in skb linear buffer for IP headers

2013-04-16 Thread Markus Stockhausen

I am afraid I don't understand what the issue is.

the pull_tail() in itself is not a performance issue : Intel guys only
fixed last gays ago fact that IGB/IXGBE drivers were not pulling tcp
headers in skb-head , and nobody noticed.

Real cost is the cache line miss.

Now, if you pull too many bytes in skb-head, say part of TCP payload,
you lose opportunities in TCP coalescing or splice().

With patch v4 netperf and NFS receive performance raises to the
expected values. As I'm no expert in this I can only repost the
initial performance report that started the whole discussion.
__pskb_pull_tail consumes a lot time on our XEON L5420 test
server.


...

- server side: netserver -p 12345
- client side: netperf -H server_ip -p 12345 -l 120

Analysis was performed on the server side with
- perf record -a -g sleep 10
- perf report

# Overhead Symbol
#   .
#
19.67%  [k] copy_user_generic_string
|
|--99.74%-- skb_copy_datagram_iovec
|  tcp_recvmsg
|  inet_recvmsg
|  sock_recvmsg
|  sys_recvfrom
|  system_call_fastpath
|  recv
|  |
|  |--50.17%-- 0x7074656e00667265
|  |
|   --49.83%-- 0x6672657074656e
 --0.26%-- [...]
 7.38%  [k] memcpy
|
|--84.56%-- __pskb_pull_tail
|  |
|  |--81.88%-- pskb_may_pull.part.6
|  |  skb_gro_header_slow
|  |  inet_gro_receive
|  |  dev_gro_receive
|  |  napi_gro_receive
|  |  ipoib_ib_handle_rx_wc
|  |  ipoib_poll
|  |  net_rx_action
|  |  __do_softirq


Markus


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH v3] IPoIB: Leave space in skb linear buffer for IP headers

2013-04-09 Thread Markus Stockhausen

 
-IPOIB_UD_HEAD_SIZE  = IB_GRH_BYTES + IPOIB_ENCAP_LEN,
+/* add 128 bytes of tailroom for IP/TCP headers */
+IPOIB_UD_HEAD_SIZE  = IB_GRH_BYTES + IPOIB_ENCAP_LEN + 128,

Hello,

the version 3 of the patch finally works. I can see the performance
gains but I cannot feel them (in real life). Here are the results
of my testbed:

Test 1:
netperf/netserver message size 16K

kernel 3.5 default :  5.1 GBit/s
kernel 3.5 + patch v3  :  7.7 GBit/s
kernel 3.5 + max MTU 3K: 10.8 GBit/s

Test 2:
Disk write performance
VM with disk mounted on IB async NFS server

block size  | default  | patch v3 | max MTU 3K
+--+--+--
   1 KB |  10 MB/s |  10 MB/s |  10 MB/s
   2 KB |  20 MB/s |  21 MB/s |  20 MB/s
   4 KB |  40 MB/s |  40 MB/s |  43 MB/s
   8 KB |  68 MB/s |  70 MB/s |  78 MB/s
  16 KB | 105 MB/s | 105 MB/s | 120 MB/s
  32 KB | 150 MB/s | 150 MB/s | 170 MB/s
  64 KB | 200 MB/s | 210 MB/s | 260 MB/s
 128 KB | 270 MB/s | 290 MB/s | 400 MB/s
 256 KB | 300 MB/s | 310 MB/s | 430 MB/s
 512 KB | 305 MB/s | 320 MB/s | 470 MB/s
1024 KB | 310 MB/s | 325 MB/s | 500 MB/s
2048 KB | 310 MB/s | 325 MB/s | 510 MB/s
4096 KB | 370 MB/s | 325 MB/s | 510 MB/s
8192 KB | 400 MB/s | 325 MB/s | 520 MB/s


As you can see netperf throughput increases while NFS does not
even care about the optimizations. Maybe it does not work well
with fragmented SKBs. The MAX MTU 3K values once again are
forced through a hack inside ipoib_main.c.

For curiosity I changed the block splitting in your v3 patch
from small head with large fragment to large head with small
fragment in this line.

IPOIB_UD_HEAD_SIZE  = IB_GRH_BYTES + IPOIB_ENCAP_LEN + 3072

In my 2044 MTU case this brings the netperf  NFS throughput to
the same levels as the dirty hack. Of course this no longer
reflects a head but equals more or less to something like a
new constant IPOIB_UD_FIXED_SKB_SIZE.

I guess 4K MTU will not see any further gains but avoiding the
skb_pull calls should improve speed as well. Maybe a final
adaption could put the cherry on the cake.

Markus


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH v2] IPoIB: Leave space in skb linear buffer for IP headers

2013-04-08 Thread Markus Stockhausen


Am 08.04.13 18:50 schrieb Roland Dreier unter rol...@kernel.org:

On Mon, Apr 8, 2013 at 9:44 AM, Eric Dumazet eduma...@google.com wrote:
 Am empty page in the frag list ? You mean a frag with a zero length ?

Yep... the code would do

skb_fill_page_desc(skb, 0, page, 0, PAGE_SIZE);

but then later on it did the equivalent of

skb_frag_size_set(frag, 0);
skb-truesize += PAGE_SIZE;

and somehow when data is flowing both directions with small packets,
that would lead the RX-side handling to return corrupt data to
userspace.

Not sure if it's worth figuring this out entirely ... the more
efficient way for ipoib to do things is to reuse that page for next
time, not pass it up the stack unused.

 - R.

First I thought perfect memory allocation inside ipoib_alloc_rx_skb()
could be realized by passing the received packet size from function
ipoib_ib_handle_rx_wc(). But I'm unable to estimate the requirements
of the similar call inside ipoib_ib_post_receives().

Nevertheless thank you very much for spending time on this.

Markus


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH v2] IPoIB: Leave space in skb linear buffer for IP headers

2013-04-05 Thread Markus Stockhausen

From: Roland Dreier rol...@purestorage.com

Markus Stockhausen markus.stockhau...@gmx.de noticed that IPoIB was
spending significant time doing memcpy() in __pskb_pull_tail().  He
found that this is because his adapter reports a maximum MTU of 4K,
which causes IPoIB datagram mode to receive all the actual data in a
separate page in the fragment list.

We're already allocating extra tailroom for the skb linear part, so we
might as well use it.

Cc: Eric Dumazet eduma...@google.com
Reported-by: Markus Stockhausen markus.stockhau...@gmx.de
Signed-off-by: Roland Dreier rol...@purestorage.com
---
v2: Try to handle the case where we get all the data in the linear part
of the skb and don't need the frag part at all.

 drivers/infiniband/ulp/ipoib/ipoib.h|  3 ++-
 drivers/infiniband/ulp/ipoib/ipoib_ib.c | 17 ++---
 2 files changed, 8 insertions(+), 12 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h
b/drivers/infiniband/ulp/ipoib/ipoib.h
index eb71aaa..ab2cc4c 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -64,7 +64,8 @@ enum ipoib_flush_level {
 enum {
 IPOIB_ENCAP_LEN  = 4,
 
-IPOIB_UD_HEAD_SIZE  = IB_GRH_BYTES + IPOIB_ENCAP_LEN,
+/* add 128 bytes of tailroom for IP/TCP headers */
+IPOIB_UD_HEAD_SIZE  = IB_GRH_BYTES + IPOIB_ENCAP_LEN + 128,
 IPOIB_UD_RX_SG  = 2, /* max buffer needed for 4K mtu */
 
 IPOIB_CM_MTU  = 0x1 - 0x10, /* padding to align header
to 16 */
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index 2cfa76f..ecf4faf 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -109,11 +109,12 @@ static void ipoib_ud_skb_put_frags(struct
ipoib_dev_priv *priv,
struct sk_buff *skb,
unsigned int length)
 {
-if (ipoib_ud_need_sg(priv-max_ib_mtu)) {
+if (ipoib_ud_need_sg(priv-max_ib_mtu) 
+length  IPOIB_UD_HEAD_SIZE) {
 skb_frag_t *frag = skb_shinfo(skb)-frags[0];
 unsigned int size;
 /*
- * There is only two buffers needed for max_payload = 4K,
+ * There are only two buffers needed for max_payload = 4K,
  * first buf size is IPOIB_UD_HEAD_SIZE
  */
 skb-tail += IPOIB_UD_HEAD_SIZE;
@@ -156,18 +157,12 @@ static struct sk_buff *ipoib_alloc_rx_skb(struct
net_device *dev, int id)
 struct ipoib_dev_priv *priv = netdev_priv(dev);
 struct sk_buff *skb;
 int buf_size;
-int tailroom;
 u64 *mapping;
 
-if (ipoib_ud_need_sg(priv-max_ib_mtu)) {
-buf_size = IPOIB_UD_HEAD_SIZE;
-tailroom = 128; /* reserve some tailroom for IP/TCP headers */
-} else {
-buf_size = IPOIB_UD_BUF_SIZE(priv-max_ib_mtu);
-tailroom = 0;
-}
+buf_size = ipoib_ud_need_sg(priv-max_ib_mtu) ?
+IPOIB_UD_HEAD_SIZE : IPOIB_UD_BUF_SIZE(priv-max_ib_mtu);
 
-skb = dev_alloc_skb(buf_size + tailroom + 4);
+skb = dev_alloc_skb(buf_size + 4);
 if (unlikely(!skb))
 return NULL;
 
-- 
1.8.1.2


Hello,

this patch is better than the first one. No more lockups of the server.
Nevertheless I'm sorry to tell you that we are not finished yet. After
running some promising netperf/netserver tests I switched over to NFS.
The machine behaves like working with an active handbrake. Throughput
varies and sometimes communication is totally stopped. dmesg shows some
very ugly output.

[39604.701231] RPC: multiple fragments per record not supported
[39604.706444] RPC: multiple fragments per record not supported
[39604.720131] RPC: multiple fragments per record not supported
[39604.720172] rpc-srv/tcp: nfsd: got error -32 when sending 140
   bytes - shutting down socket
[39604.750376] RPC: multiple fragments per record not supported
[39604.767136] RPC: multiple fragments per record not supported
[39604.803122] RPC: multiple fragments per record not supported
[39604.836599] RPC: multiple fragments per record not supported
[39604.846579] RPC: multiple fragments per record not supported
[39604.854730] RPC: multiple fragments per record not supported
[39604.862385] RPC: multiple fragments per record not supported
[39604.911370] rpc-srv/tcp: nfsd: got error -32 when sending 140
   bytes - shutting down socket
[39607.278661] rpc-srv/tcp: nfsd: got error -32 when sending 140
   bytes - shutting down socket
[39629.669290] receive_cb_reply: Got unrecognized reply: calldir
   0x1 xpt_bc_xprt   (null) xid 50916007
[39629.669303] net_ratelimit: 98 callbacks suppressed
[39629.669306] RPC: multiple fragments per record not supported


Maybe you you know what problem is left.

Thank you very much.

Markus.

P.S. I did some more comprehensive tests on my Intel L5420 NFS
DDR Infiniband Server with my initial netperf/netserver scenario
and the dirty trick forcing the device max MTU to 3K

Re: [PATCH/RFC] IPoIB: Leave space in skb linear buffer for IP headers

2013-04-04 Thread Markus Stockhausen

From: Roland Dreier rol...@purestorage.com

Markus Stockhausen markus.stockhau...@gmx.de noticed that IPoIB was
spending significant time doing memcpy() in __pskb_pull_tail().  He
found that this is because his adapter reports a maximum MTU of 4K,
which causes IPoIB datagram mode to receive all the actual data in a
separate page in the fragment list.

We're already allocating extra tailroom for the skb linear part, so we
might as well use it.

Cc: Eric Dumazet eduma...@google.com
Reported-by: Markus Stockhausen markus.stockhau...@gmx.de
Signed-off-by: Roland Dreier rol...@purestorage.com
---
 drivers/infiniband/ulp/ipoib/ipoib.h|  3 ++-
 drivers/infiniband/ulp/ipoib/ipoib_ib.c | 12 +++-
 2 files changed, 5 insertions(+), 10 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h
b/drivers/infiniband/ulp/ipoib/ipoib.h
index eb71aaa..ab2cc4c 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -64,7 +64,8 @@ enum ipoib_flush_level {
 enum {
 IPOIB_ENCAP_LEN  = 4,
 
-IPOIB_UD_HEAD_SIZE  = IB_GRH_BYTES + IPOIB_ENCAP_LEN,
+/* add 128 bytes of tailroom for IP/TCP headers */
+IPOIB_UD_HEAD_SIZE  = IB_GRH_BYTES + IPOIB_ENCAP_LEN + 128,
...

Thanks for the help but I guess the patch is not yet perfect.

My (remote) test machine stopped responding after loading the new
ipoib module. Tomorrow I can check the console. Having a look at the
source code I guess we now have some major problems when receiving
small packets:

...
ipoib_ud_skb_put_frags(..., unsigned int length)
  ...
  size = length - IPOIB_UD_HEAD_SIZE; /* may be less than zero! */
  skb_frag_size_set(frag, size);
  ...

Markus


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPoIB - GRO forces memcpy inside __pskb_pull_tail

2013-04-03 Thread Markus Stockhausen
Hello,


...
 If I get it right round about 6% (7.38% * 84.56%) of the time the
machine does a
 memcpy inside __pskb_pull_tail. The comments on this function reads ...
 it expands
 header moving its tail forward and copying necessary data from
fragmented part. ...
 It is pretty complicated. Luckily, it is called only in exceptional
cases 
 That does not sound good at all. I repeated the test on a normal Intel
gigabit
 network without jumbo frames and __pskb_pull_tail was not in the top
consumer list.

 Does anyone have an idea if this is normal GRO behaviour for IPOIB. At
the moment
 I have a full test environment and could implement and verify some
kernel
 corrections if someone could give a helpful hint.

As always, it would be good and helpful if you can re-run the test
with the latest upstream kernel, e.g 3.9-rc, and anyway, I added Eric
who might have some insight on the matter.

Or.


going through hard lessons to understand the SKBs maybe I finally
found the reason for the unnecessary memcpy commands. Even with
newest 3.9-rc5 kernel the problem persists. IPoIB creates only
fragmented SKBs without any single bit in the normal data part. Some
debug messages during GRO handling showed

skb-len = 1988 (total data)
skb-data_len= 1988 (paged data)
skb_headlen(skb) = 0(non paged data)

inet_gro_receive() requires the IP header inside the SKB. So it
pulls missing data from fragments. This process requires extra
memcpy operations. 

It all comes from ipoib_ud_need_sg() that determines if a
receive block will fit into a single page. Whenever this
function is called the one and only parameter is max_ib_mtu
of the device. In my case with a ConnectX card this defaults
to 4K no matter what MTU is really set. As a result IPoIB will
always create a separate SKB fragment for the incoming data.

My old but nicely working switch only allows a MTU of 2044 bytes.
So I assumed that I do not need to care about fragments and
modifed the priv-max_ib_mtu hardcoded to 3072. Pages are sufficient
large for this MTU. A quick test afterwards without claim of
perfectionism showed the expected effects.

1) no more additional memcpy operations
2) netperf throughput raised from ~ 5.3GBit to ~ 5.8GBit

I hope that I'm not totally wrong with this finding and my
simple explanation is conclusive. Maybe someone with more
knowledge about this all can assist me to get an offical
patch into the RDMA development tree?

Thanks in advance.

Markus


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


IPoIB - GRO forces memcpy inside __pskb_pull_tail

2013-04-02 Thread Markus Stockhausen
Hello,
 
today I did some IPoIB profiling on one of our infiniband servers.
Environment on server side is

- Kernel:  3.5.0-26-generic #42~precise1-Ubuntu
- Mellanox Technologies MT26418 (LnkSta: Speed 2.5GT/s, Width x8)
- Infiniband MTU 2044 (cannot increase to 4K because of old switch)
- one 4 core Intel(R) Xeon(R) CPU L5420 @ 2.50GHz
 
With different client machines I executed a netperf load test.
 
- server side: netserver -p 12345
- client side: netperf -H server_ip -p 12345 -l 120
 
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to ...
Recv   SendSend
Socket Socket  Message  Elapsed
Size   SizeSize Time Throughput
bytes  bytes   bytessecs.10^6bits/sec
 
87380  16384  16384120.00   5078.92
 
Analysis was performed on the server side with
 
- perf record -a -g sleep 10
- perf report
 
The result starts with:
 
# Overhead Symbol
#   .
#
19.67%  [k] copy_user_generic_string
|
|--99.74%-- skb_copy_datagram_iovec
|  tcp_recvmsg
|  inet_recvmsg
|  sock_recvmsg
|  sys_recvfrom
|  system_call_fastpath
|  recv
|  |
|  |--50.17%-- 0x7074656e00667265
|  |
|   --49.83%-- 0x6672657074656e
 --0.26%-- [...]
 7.38%  [k] memcpy
|
|--84.56%-- __pskb_pull_tail
|  |
|  |--81.88%-- pskb_may_pull.part.6
|  |  skb_gro_header_slow
|  |  inet_gro_receive
|  |  dev_gro_receive
|  |  napi_gro_receive
|  |  ipoib_ib_handle_rx_wc
|  |  ipoib_poll
|  |  net_rx_action
|  |  __do_softirq
 
If I get it right round about 6% (7.38% * 84.56%) of the time the machine
does a 
memcpy inside __pskb_pull_tail. The comments on this function reads ...
it expands 
header moving its tail forward and copying necessary data from fragmented
part. ... 
It is pretty complicated. Luckily, it is called only in exceptional cases
 
That does not sound good at all. I repeated the test on a normal Intel
gigabit 
network without jumbo frames and __pskb_pull_tail was not in the top
consumer list.
 
Does anyone have an idea if this is normal GRO behaviour for IPOIB. At the
moment 
I have a full test environment and could implement and verify some kernel
corrections if someone could give a helpful hint.
 
Thanks in advance.
 
Markus
 



--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html