Re: Determine memory bandwidth machine

2018-01-16 Thread Matt P. Dziubinski
On Sunday, January 14, 2018 at 7:44:00 PM UTC+1, Peter Veentjer wrote:
>
>
> What is the best tool to determine the maximum bandwidth of a machine 
> running Linux (RHEL 7)
>

Intel Memory Latency Checker (MLC) is an option (despite the name, it can 
also be used to measure bandwidth):
https://www.intel.com/software/mlc

See also:
Memory Latency and NUMA: 
http://www.qdpma.com/ServerSystems/MemoryLatency.html
http://frankdenneman.nl/2015/02/27/memory-deep-dive-numa-data-locality/
http://frankdenneman.nl/2016/07/13/numa-deep-dive-4-local-memory-optimization/
https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/600141
http://techblog.cloudperf.net/2016/09/exploring-numa-on-amazon-cloud-instances.html#numapolicy

Best,
Matt

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Determine memory bandwidth machine

2018-01-16 Thread Gil Tene
Good point about the numa part. My dd example may well end up allocating 
memory for the file in /tmp on one socket or another, causing all the reads 
to hit that one socket's memory channels.

For a good spread, a C program will probably be best. Allocate memory from 
the local NUMA node, and run multiple threads reading that memory for speed 
clocking. then run one copy of this program bound to each socket (with 
numactl -N) and sum up the numbers (for an idealized max memory bandwidth 
thing that does no cross-socket access).

On Tuesday, January 16, 2018 at 10:54:22 AM UTC-8, Chet L wrote:
>
> Agree with Gil (w.r.t DIMM slots, channels etc) but may be not on the 'dd' 
> script. Here's why. Some BIOS's have 'memory interleaving' option turned 
> OFF. And if the OS-wide memory policy is non-interleaving too then in that 
> case unless your application explicitly binds memory to a remote socket you 
> cannot interleave memory. Or you would need to use numa tools (to set mem 
> policy etc) while launching your application.
>
> Bandwidth or latency monitoring is also dependent on the workload you are 
> running. If the workload a) runs atomics b) is running on Node-0 only 
> (memory is local or striped) : then the snoop-responses are going take 
> longer(uncore frequency etc) because socket-1 might be in power-saving 
> states. So ideally you would need a 'snoozer' thread on the remote 
> socket(Node-1) which would prevent the socket from entering one of the 'C' 
> (or whatever) states (or you can disable hardware power-saving modes - but 
> you may need to line up all the options because the kernel may have 
> power-saving options too).  If you use the industry standard tools like 
> 'stream' (as others mentioned) etc they will do all of this for you (via 
> dummy/snoozer threads and so on).
>
> If you want to write this all by yourself then you should know the numa 
> apis (http://man7.org/linux/man-pages/man3/numa.3.html), numa tools 
> (numactl, taskset). For latency measurements you should also disable 
> pre-fetching else it will give you super awesome numbers.
>
> Note: In future, if you use a device-driver that does the allocation for 
> your app then you should make sure the driver knows about numa allocation 
> too and aligns everything for you. I found that out the hard way back in 
> 2010 and realized that the linux-kernel had no guidance for pcie 
> drivers(back then ... its ok now). I fixed it locally at the time.
>
> Hope this helps.
>
> Chetan Loke
>
>
> On Monday, January 15, 2018 at 11:19:53 AM UTC-5, Kevin Bowling wrote:
>>
>> lmbench works well http://www.bitmover.com/lmbench/man_lmbench.html, 
>> and Larry seems happy to answer questions on building/using it. 
>>
>> Unless you've explicitly built an application to work with NUMA, or 
>> are able to run two copies of an application pinned to each domain, 
>> you really only will get about 1 package worth of BW, and latecny is a 
>> bigger deal (which lmbench can also measure in cooperation with 
>> numactl) 
>>
>> Regards, 
>>
>> On Sun, Jan 14, 2018 at 11:44 AM, Peter Veentjer  
>> wrote: 
>> > I'm working on some very simple aggregations on huge chunks of offheap 
>> > memory (500GB+) for a hackaton. This is done using a very simple 
>> stride; 
>> > every iteration the address increases with 20 bytes. So the prefetcher 
>> > should not have any problems with it. 
>> > 
>> > According to my calculations I'm currently processing 35 GB/s. However 
>> I'm 
>> > not sure if I'm close to the maximum bandwidth of this machine. Specs: 
>> > 2133 MHz, 24x HP 32GiB 4Rx4 PC4-2133P 
>> > 2x Intel(R) Xeon(R) CPU E5-2687W v3, 3.10GHz, 10 cores per socket 
>> > 
>> > What is the best tool to determine the maximum bandwidth of a machine 
>> > running Linux (RHEL 7) 
>> > 
>> > -- 
>> > You received this message because you are subscribed to the Google 
>> Groups 
>> > "mechanical-sympathy" group. 
>> > To unsubscribe from this group and stop receiving emails from it, send 
>> an 
>> > email to mechanical-sympathy+unsubscr...@googlegroups.com. 
>> > For more options, visit https://groups.google.com/d/optout. 
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Determine memory bandwidth machine

2018-01-16 Thread Chet L
Agree with Gil (w.r.t DIMM slots, channels etc) but may be not on the 'dd' 
script. Here's why. Some BIOS's have 'memory interleaving' option turned 
OFF. And if the OS-wide memory policy is non-interleaving too then case 
unless your application explicitly binds memory to a remote socket you 
cannot interleave memory. Or you would need to use numa tools (to set mem 
policy etc) while launching your application.

Bandwidth or latency monitoring is also dependent on the workload you are 
running. If the workload a) runs atomics b) is running on Node-0 only but 
memory across both : then the snoop-responses are going take longer because 
socket-1 might be in power-saving states (uncore frequency etc). So ideally 
you would need a 'snoozer' thread on the remote socket(Node-1) which would 
prevent the socket from entering one of the 'C' (or whatever) states (or 
you can disable hardware power-saving modes - but you may need to line up 
all the options because the kernel may have power-saving options too).  If 
you use the industry standard tools like 'stream' (as others mentioned) etc 
they will do all of this for you (via dummy/snoozer threads and so on).

If you want to write this all by yourself then you should know the numa 
apis (http://man7.org/linux/man-pages/man3/numa.3.html), numa tools 
(numactl, taskset). For latency measurements you should also disable 
pre-fetching else it will give you super awesome numbers.

Note: In future, if you use a device-driver that does the allocation for 
your app then you should make sure the driver knows about numa allocation 
too and aligns everything for you. I found that out the hard way back in 
2010 and realized that the linux-kernel had no guidance for pcie 
drivers(back then ... its ok now). I fixed it locally at the time.

Hope this helps.

Chetan Loke


On Monday, January 15, 2018 at 11:19:53 AM UTC-5, Kevin Bowling wrote:
>
> lmbench works well http://www.bitmover.com/lmbench/man_lmbench.html, 
> and Larry seems happy to answer questions on building/using it. 
>
> Unless you've explicitly built an application to work with NUMA, or 
> are able to run two copies of an application pinned to each domain, 
> you really only will get about 1 package worth of BW, and latecny is a 
> bigger deal (which lmbench can also measure in cooperation with 
> numactl) 
>
> Regards, 
>
> On Sun, Jan 14, 2018 at 11:44 AM, Peter Veentjer  > wrote: 
> > I'm working on some very simple aggregations on huge chunks of offheap 
> > memory (500GB+) for a hackaton. This is done using a very simple stride; 
> > every iteration the address increases with 20 bytes. So the prefetcher 
> > should not have any problems with it. 
> > 
> > According to my calculations I'm currently processing 35 GB/s. However 
> I'm 
> > not sure if I'm close to the maximum bandwidth of this machine. Specs: 
> > 2133 MHz, 24x HP 32GiB 4Rx4 PC4-2133P 
> > 2x Intel(R) Xeon(R) CPU E5-2687W v3, 3.10GHz, 10 cores per socket 
> > 
> > What is the best tool to determine the maximum bandwidth of a machine 
> > running Linux (RHEL 7) 
> > 
> > -- 
> > You received this message because you are subscribed to the Google 
> Groups 
> > "mechanical-sympathy" group. 
> > To unsubscribe from this group and stop receiving emails from it, send 
> an 
> > email to mechanical-sympathy+unsubscr...@googlegroups.com . 
>
> > For more options, visit https://groups.google.com/d/optout. 
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Determine memory bandwidth machine

2018-01-15 Thread Kevin Bowling
lmbench works well http://www.bitmover.com/lmbench/man_lmbench.html,
and Larry seems happy to answer questions on building/using it.

Unless you've explicitly built an application to work with NUMA, or
are able to run two copies of an application pinned to each domain,
you really only will get about 1 package worth of BW, and latecny is a
bigger deal (which lmbench can also measure in cooperation with
numactl)

Regards,

On Sun, Jan 14, 2018 at 11:44 AM, Peter Veentjer  wrote:
> I'm working on some very simple aggregations on huge chunks of offheap
> memory (500GB+) for a hackaton. This is done using a very simple stride;
> every iteration the address increases with 20 bytes. So the prefetcher
> should not have any problems with it.
>
> According to my calculations I'm currently processing 35 GB/s. However I'm
> not sure if I'm close to the maximum bandwidth of this machine. Specs:
> 2133 MHz, 24x HP 32GiB 4Rx4 PC4-2133P
> 2x Intel(R) Xeon(R) CPU E5-2687W v3, 3.10GHz, 10 cores per socket
>
> What is the best tool to determine the maximum bandwidth of a machine
> running Linux (RHEL 7)
>
> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to mechanical-sympathy+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Determine memory bandwidth machine

2018-01-15 Thread Holger Hoffstätte
On 01/14/18 19:44, Peter Veentjer wrote:
> I'm working on some very simple aggregations on huge chunks of
> offheap memory (500GB+) for a hackaton. This is done using a very
> simple stride; every iteration the address increases with 20 bytes.
> So the prefetcher should not have any problems with it.
> 
> According to my calculations I'm currently processing 35 GB/s.
> However I'm not sure if I'm close to the maximum bandwidth of this
> machine. Specs: 2133 MHz, 24x HP 32GiB 4Rx4 PC4-2133P 2x Intel(R)
> *Xeon*(R) CPU E5-2687W v3, 3.10GHz, 10 cores per socket
> 
> What is the best tool to determine the maximum bandwidth of a machine
> running Linux (RHEL 7)

I recently had the same question (out of curiosity, after reading
about Ryzen/EPYC memory performance) and still had my bookmarks,
so here goes.

- The 'perf' utility usually used for performance measurements has
a memory benchmark. Somewhat fiddly with its parameters but OK for a
quick test. Single-threaded only and you really need to pass larger
memory blocks, otherwise you might only get cache bandwidth.

- 'mbw' is also single-thread only, but quick & easy to run.
Make sure to pass proper CFLAGS, otherwise it will build without
any optimization at all.

- 'pmbw' [2] is a parallel version of mbw with assembly loops, SSE/AVX
and many variants of accesses (forwards, backwards, sideways ;).
Unfortunately it has completely unreadable output; this is offset by
the built-in capability to pass the output to gnuplot and make pretty
pictures. Also has pretty extensive benchmark results on the website.

- The "industry-standard" bandwidth benchmark is STREAM [3] by
John McCalpin of SGI and comp.arch fame. Unfortunately the original code
has been hacked on by various people, so different versions float around
more or less unmaintained. I found two forks that are easy to use:

- [4] is a cleaned-up version with optional OpenMP support that
should build out of the box. Just get stream.c and build, with or
without OpenMP. You REALLY need to pass much higher values for
STREAM_ARRAY_SIZE (at least ~50-80x) and NTIMES (~10x), otherwise
the run will be too short & meaningless on your machine.

- [5] is another fork with NUMA suppoort. This is relevant
because you have two sockets and are probably running without
NUMA affinity, effectively trashing your caches not just from the
local CPU but also from the other..just like real applications
without NUMA awareness tend to do. :(

In any case make sure to build STREAM with -O3 -march=native.
Pass -fopenmp to get default OpenMP support. The NUMA fork has both
the OpenMP and a version with "manual threading" with explicit NUMA
awareness.

Happy benchmarking!

Holger

[1] https://github.com/raas/mbw
[2] https://github.com/bingmann/pmbw
[3] http://www.cs.virginia.edu/stream/
[4] https://github.com/jeffhammond/STREAM
[5] https://github.com/larsbergstrom/NUMA-STREAM

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Determine memory bandwidth machine

2018-01-14 Thread Gil Tene
Per e.g. the Intel Ark entry for the E5-2687W v3 

 
the theoretical bandwidth of a single CPU E5-2687W v3 socket is 68GB/sec. 
And for a 2 socket system that would be 136GB/sec. Achieving that full 
theoretical bandwidth would require the right alignment of stars tho, 
including the right DIMM population, number of ranks, access pattern within 
each channel, and full use of all memory channels simultaneously.

I'd say that if you see 68GB/sec (50% of max theoretical) you are probably 
in fairly good shape.

If you want to empirically test this, I'd do something like repeated (dd 
bs=2m if=/tmp/oneGigFile of=/dev/null) or equivalent on multiple shells 
simultanuously (make a 1 gig file with mkfile -n 1g /tmp/oneGigFile), and 
grow the number of shells until you see a peak to aggregate reported 
throughput... You can obviously write a short c program to do this as 
well...

On Sunday, January 14, 2018 at 8:20:07 PM UTC-8, Peter Veentjer wrote:
>
> I discovered that the system wasn't effectively utilizing the cpu's. 
> Initially I thought it was caused by cpu throttling due to overheating, but 
> after reading out the temperatures, this hypotheses is not correct.
>
> I have increased the number of threads that generate requests. And  the a 
> saw pattern on the cpu load disappears (forking and joining) and it remains 
> constant at roughly 80%. And currently I'm at a aggregating 55 GB/s. 
>
> I also played with 50GB offheap chunk and I'm up to 68 GB/s. 
>
> It would still be interesting to know if there is a tool that can show the 
> maximum bandwidth of the memory bus.
>
> On Sunday, January 14, 2018 at 8:44:00 PM UTC+2, Peter Veentjer wrote:
>>
>> I'm working on some very simple aggregations on huge chunks of offheap 
>> memory (500GB+) for a hackaton. This is done using a very simple stride; 
>> every iteration the address increases with 20 bytes. So the prefetcher 
>> should not have any problems with it.
>>
>> According to my calculations I'm currently processing 35 GB/s. However 
>> I'm not sure if I'm close to the maximum bandwidth of this machine. Specs:
>> 2133 MHz, 24x HP 32GiB 4Rx4 PC4-2133P
>> 2x Intel(R) *Xeon*(R) CPU E5-2687W v3, 3.10GHz, 10 cores per socket
>>
>> What is the best tool to determine the maximum bandwidth of a machine 
>> running Linux (RHEL 7)
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Determine memory bandwidth machine

2018-01-14 Thread Peter Veentjer
I discovered that the system wasn't effectively utilizing the cpu's. 
Initially I thought it was caused by cpu throttling due to overheating, but 
after reading out the temperatures, this hypotheses is not correct.

I have increased the number of threads that generate requests. And  the a 
saw pattern on the cpu load disappears (forking and joining) and it remains 
constant at roughly 80%. And currently I'm at a aggregating 55 GB/s. 

I also played with 50GB offheap chunk and I'm up to 68 GB/s. 

It would still be interesting to know if there is a tool that can show the 
maximum bandwidth of the memory bus.

On Sunday, January 14, 2018 at 8:44:00 PM UTC+2, Peter Veentjer wrote:
>
> I'm working on some very simple aggregations on huge chunks of offheap 
> memory (500GB+) for a hackaton. This is done using a very simple stride; 
> every iteration the address increases with 20 bytes. So the prefetcher 
> should not have any problems with it.
>
> According to my calculations I'm currently processing 35 GB/s. However I'm 
> not sure if I'm close to the maximum bandwidth of this machine. Specs:
> 2133 MHz, 24x HP 32GiB 4Rx4 PC4-2133P
> 2x Intel(R) *Xeon*(R) CPU E5-2687W v3, 3.10GHz, 10 cores per socket
>
> What is the best tool to determine the maximum bandwidth of a machine 
> running Linux (RHEL 7)
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Determine memory bandwidth machine

2018-01-14 Thread Peter Veentjer
Some additional information.

The memory is broken up into chunks of 8MB and are executed in parallel 
using the fork join framework. So all cores are busy iterating over the 
memory.

On Sunday, January 14, 2018 at 8:44:00 PM UTC+2, Peter Veentjer wrote:
>
> I'm working on some very simple aggregations on huge chunks of offheap 
> memory (500GB+) for a hackaton. This is done using a very simple stride; 
> every iteration the address increases with 20 bytes. So the prefetcher 
> should not have any problems with it.
>
> According to my calculations I'm currently processing 35 GB/s. However I'm 
> not sure if I'm close to the maximum bandwidth of this machine. Specs:
> 2133 MHz, 24x HP 32GiB 4Rx4 PC4-2133P
> 2x Intel(R) *Xeon*(R) CPU E5-2687W v3, 3.10GHz, 10 cores per socket
>
> What is the best tool to determine the maximum bandwidth of a machine 
> running Linux (RHEL 7)
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Determine memory bandwidth machine

2018-01-14 Thread Peter Veentjer
I'm working on some very simple aggregations on huge chunks of offheap 
memory (500GB+) for a hackaton. This is done using a very simple stride; 
every iteration the address increases with 20 bytes. So the prefetcher 
should not have any problems with it.

According to my calculations I'm currently processing 35 GB/s. However I'm 
not sure if I'm close to the maximum bandwidth of this machine. Specs:
2133 MHz, 24x HP 32GiB 4Rx4 PC4-2133P
2x Intel(R) *Xeon*(R) CPU E5-2687W v3, 3.10GHz, 10 cores per socket

What is the best tool to determine the maximum bandwidth of a machine 
running Linux (RHEL 7)

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.