I'm not sure if I have to say sorry for the noise or not but it seems
that the issue was just an NUMA issue!
My system is a 2 node NUMA system and the IB board is attached on NODE 0.
Not performing any cpu/mem affinity it seems the code runs on the
worst node, always!

Without affinity ( I did run this several time):

$ rstream -s 10.30.3.2
name      bytes   xfers   iters   total       time     Gb/sec    usec/xfer
64_lat    64      1       1m      122m        4.36s      0.23       2.18
4k_lat    4k      1       100k    781m        2.42s      2.70      12.12
64k_lat   64k     1       10k     1.2g        1.70s      6.17      84.92
1m_lat    1m      1       100     200m        0.26s      6.53    1284.81
64_bw     64      1m      1       122m        1.38s      0.74       0.69
4k_bw     4k      100k    1       781m        1.02s      6.44       5.09
64k_bw    64k     10k     1       1.2g        1.54s      6.82      76.93
1m_bw     1m      100     1       200m        0.25s      6.61    1268.28

Affinity on node 1 (the worst one):

$ numactl --membind=1 --cpunodebind=1 rstream -s 10.30.3.2
name      bytes   xfers   iters   total       time     Gb/sec    usec/xfer
64_lat    64      1       1m      122m        4.36s      0.23       2.18
4k_lat    4k      1       100k    781m        2.42s      2.70      12.11
64k_lat   64k     1       10k     1.2g        1.70s      6.18      84.90
1m_lat    1m      1       100     200m        0.26s      6.53    1284.71
64_bw     64      1m      1       122m        1.38s      0.74       0.69
4k_bw     4k      100k    1       781m        1.02s      6.44       5.09
64k_bw    64k     10k     1       1.2g        1.54s      6.82      76.91
1m_bw     1m      100     1       200m        0.25s      6.61    1269.56

Affinity on node 0:

$ numactl --membind=0 --cpunodebind=0 rstream -s 10.30.3.2
name      bytes   xfers   iters   total       time     Gb/sec    usec/xfer
64_lat    64      1       1m      122m        3.81s      0.27       1.90
4k_lat    4k      1       100k    781m        1.88s      3.49       9.39
64k_lat   64k     1       10k     1.2g        1.10s      9.56      54.82
1m_lat    1m      1       100     200m        0.15s     11.41     735.00
64_bw     64      1m      1       122m        0.92s      1.11       0.46
4k_bw     4k      100k    1       781m        0.59s     11.07       2.96
64k_bw    64k     10k     1       1.2g        0.89s     11.73      44.69
1m_bw     1m      100     1       200m        0.14s     11.70     716.98

Being an RDMA the most important affinity is the memory one.
And now doing the affinity even the custom test runs as expected:

$ numactl --membind=0 --cpunodebind=0 rstream -s 10.30.3.2 -S 6291456
name      bytes   xfers   iters   total       time     Gb/sec    usec/xfer
custom    6m      1k      1       11g         8.91s     11.29    4456.94

alias 1445.12 MB/sec using rstream

Now even my application reaches 1200 MB/sec that is not using rsocket (yet).
Is my OS to blame ?

What is strange is that ib_write_bw seems it's not affected by affinity,
doesn't matter on which node I do the affinity it still reports
1500MB/s. I had a
quick look at the code and it seems it doesn't perform any cpuaffinity
on his own.

May be rsocket has to do a memory affinity on the node with the IB
board attached on
before to allocate his own memory?

Gaetano


On Tue, Aug 28, 2012 at 8:42 PM, Hefty, Sean <sean.he...@intel.com> wrote:
>> $ ./examples/rstream -s 10.30.3.2 -S all
>> name      bytes   xfers   iters   total       time     Gb/sec    usec/xfer
>> 16k_lat   16k     1       10k     312m        0.52s      5.06      25.93
>> 24k_lat   24k     1       10k     468m        0.82s      4.79      41.08
>> 32k_lat   32k     1       10k     625m        0.91s      5.76      45.51
>> 48k_lat   48k     1       10k     937m        1.50s      5.26      74.82
>> 64k_lat   64k     1       10k     1.2g        1.74s      6.04      86.77
>> 96k_lat   96k     1       10k     1.8g        2.45s      6.42     122.52
>> 128k_lat  128k    1       1k      250m        0.33s      6.38     164.35
>> 192k_lat  192k    1       1k      375m        0.56s      5.66     277.78
>> 256k_lat  256k    1       1k      500m        0.65s      6.42     326.71
>> 384k_lat  384k    1       1k      750m        0.85s      7.43     423.59
>> 512k_lat  512k    1       1k      1000m       1.28s      6.55     640.76
>> 768k_lat  768k    1       1k      1.4g        2.15s      5.86    1072.87
>> 1m_lat    1m      1       100     200m        0.30s      5.54    1514.93
>> 1.5m_lat  1.5m    1       100     300m        0.26s      9.54    1319.66
>> 2m_lat    2m      1       100     400m        0.60s      5.60    2993.67
>> 3m_lat    3m      1       100     600m        0.90s      5.58    4509.93
>> 4m_lat    4m      1       100     800m        1.20s      5.57    6023.30
>> 6m_lat    6m      1       100     1.1g        1.00s     10.10    4982.83
>> 16k_bw    16k     10k     1       312m        0.39s      6.74      19.45
>> 24k_bw    24k     10k     1       468m        0.71s      5.53      35.56
>> 32k_bw    32k     10k     1       625m        0.95s      5.53      47.42
>> 48k_bw    48k     10k     1       937m        1.42s      5.55      70.91
>> 64k_bw    64k     10k     1       1.2g        1.89s      5.55      94.44
>> 96k_bw    96k     10k     1       1.8g        2.83s      5.56     141.43
>> 128k_bw   128k    1k      1       250m        0.38s      5.56     188.60
>> 192k_bw   192k    1k      1       375m        0.57s      5.57     282.62
>> 256k_bw   256k    1k      1       500m        0.65s      6.50     322.76
>> 384k_bw   384k    1k      1       750m        1.13s      5.58     563.75
>> 512k_bw   512k    1k      1       1000m       1.50s      5.58     751.58
>> 768k_bw   768k    1k      1       1.4g        2.26s      5.57    1129.26
>> 1m_bw     1m      100     1       200m        0.16s     10.24     819.18
>
> I think there's something else going on.  There really shouldn't be huge 
> jumps in the bandwidth like this.
>
> I don't know if this indicates a problem with the HCA (is the firmware up to 
> date?), the switch, the PCI bus, the chipset, or what.  What is your 
> performance running the client and server on the same system?
>
>> 1.5m_bw   1.5m    100     1       300m        0.45s      5.61    2241.51
>> 2m_bw     2m      100     1       400m        0.60s      5.59    3001.57
>> 3m_bw     3m      100     1       600m        0.90s      5.57    4515.06
>> 4m_bw     4m      100     1       800m        0.65s     10.34    3245.21
>> 6m_bw     6m      100     1       1.1g        1.81s      5.56    9046.91
>>
>> starting with 48k test then it seems that maxim (~10Gb/sec) is obtained at 
>> 3m:
>>
>> $ ./examples/rstream -b 10.30.3.2 -S all
>> name      bytes   xfers   iters   total       time     Gb/sec    usec/xfer
>> 48k_lat   48k     1       10k     937m        1.40s      5.62      69.96
>> 64k_lat   64k     1       10k     1.2g        1.93s      5.44      96.43
>> 96k_lat   96k     1       10k     1.8g        2.62s      6.01     130.87
>> 128k_lat  128k    1       1k      250m        0.37s      5.62     186.71
>> 192k_lat  192k    1       1k      375m        0.50s      6.33     248.64
>> 256k_lat  256k    1       1k      500m        0.58s      7.22     290.45
>> 384k_lat  384k    1       1k      750m        0.95s      6.62     475.05
>> 512k_lat  512k    1       1k      1000m       1.44s      5.82     721.16
>> 768k_lat  768k    1       1k      1.4g        1.97s      6.38     986.84
>> 1m_lat    1m      1       100     200m        0.19s      8.74     959.41
>> 1.5m_lat  1.5m    1       100     300m        0.44s      5.69    2212.52
>> 2m_lat    2m      1       100     400m        0.60s      5.62    2986.33
>> 3m_lat    3m      1       100     600m        0.90s      5.58    4506.85
>> 4m_lat    4m      1       100     800m        0.68s      9.81    3419.98
>> 6m_lat    6m      1       100     1.1g        1.55s      6.49    7758.06
>> 48k_bw    48k     10k     1       937m        1.16s      6.75      58.22
>> 64k_bw    64k     10k     1       1.2g        1.89s      5.55      94.39
>> 96k_bw    96k     10k     1       1.8g        2.83s      5.56     141.41
>> 128k_bw   128k    1k      1       250m        0.38s      5.58     188.04
>> 192k_bw   192k    1k      1       375m        0.52s      6.01     261.88
>> 256k_bw   256k    1k      1       500m        0.75s      5.57     376.28
>> 384k_bw   384k    1k      1       750m        1.13s      5.58     564.04
>> 512k_bw   512k    1k      1       1000m       1.50s      5.58     752.06
>> 768k_bw   768k    1k      1       1.4g        1.61s      7.80     807.06
>> 1m_bw     1m      100     1       200m        0.30s      5.63    1490.35
>> 1.5m_bw   1.5m    100     1       300m        0.45s      5.60    2248.11
>> 2m_bw     2m      100     1       400m        0.60s      5.58    3005.60
>> 3m_bw     3m      100     1       600m        0.50s      9.98    2522.82
>> 4m_bw     4m      100     1       800m        1.19s      5.62    5971.85
>> 6m_bw     6m      100     1       1.1g        1.80s      5.59    8998.39
>>
>>
>> I don't know what there is behind exactly but it seems that each test depends
>> on what was done in the past.
>
> The alignment of the data along cache lines would be different.  I'll be 
> surprised if that makes this large of a difference.
>
> For bandwidth testing, you want a large QP size (sqsize_default and 
> rqsize_default set to 512 or 1024), large send/receive buffers (mem_default 
> and wmem_default set to 1M+), and a small inline data size (inline_default of 
> 16 or 32).  rstream should configure some of these manually, depending on the 
> testing options.  But the performance you're seeing is varying so greatly 
> that I don't think the software is the issue.
>
> - Sean



-- 
cpp-today.blogspot.com
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to