Re: Determine memory bandwidth machine
On Sunday, January 14, 2018 at 7:44:00 PM UTC+1, Peter Veentjer wrote: > > > What is the best tool to determine the maximum bandwidth of a machine > running Linux (RHEL 7) > Intel Memory Latency Checker (MLC) is an option (despite the name, it can also be used to measure bandwidth): https://www.intel.com/software/mlc See also: Memory Latency and NUMA: http://www.qdpma.com/ServerSystems/MemoryLatency.html http://frankdenneman.nl/2015/02/27/memory-deep-dive-numa-data-locality/ http://frankdenneman.nl/2016/07/13/numa-deep-dive-4-local-memory-optimization/ https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/600141 http://techblog.cloudperf.net/2016/09/exploring-numa-on-amazon-cloud-instances.html#numapolicy Best, Matt -- You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group. To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Determine memory bandwidth machine
Good point about the numa part. My dd example may well end up allocating memory for the file in /tmp on one socket or another, causing all the reads to hit that one socket's memory channels. For a good spread, a C program will probably be best. Allocate memory from the local NUMA node, and run multiple threads reading that memory for speed clocking. then run one copy of this program bound to each socket (with numactl -N) and sum up the numbers (for an idealized max memory bandwidth thing that does no cross-socket access). On Tuesday, January 16, 2018 at 10:54:22 AM UTC-8, Chet L wrote: > > Agree with Gil (w.r.t DIMM slots, channels etc) but may be not on the 'dd' > script. Here's why. Some BIOS's have 'memory interleaving' option turned > OFF. And if the OS-wide memory policy is non-interleaving too then in that > case unless your application explicitly binds memory to a remote socket you > cannot interleave memory. Or you would need to use numa tools (to set mem > policy etc) while launching your application. > > Bandwidth or latency monitoring is also dependent on the workload you are > running. If the workload a) runs atomics b) is running on Node-0 only > (memory is local or striped) : then the snoop-responses are going take > longer(uncore frequency etc) because socket-1 might be in power-saving > states. So ideally you would need a 'snoozer' thread on the remote > socket(Node-1) which would prevent the socket from entering one of the 'C' > (or whatever) states (or you can disable hardware power-saving modes - but > you may need to line up all the options because the kernel may have > power-saving options too). If you use the industry standard tools like > 'stream' (as others mentioned) etc they will do all of this for you (via > dummy/snoozer threads and so on). > > If you want to write this all by yourself then you should know the numa > apis (http://man7.org/linux/man-pages/man3/numa.3.html), numa tools > (numactl, taskset). For latency measurements you should also disable > pre-fetching else it will give you super awesome numbers. > > Note: In future, if you use a device-driver that does the allocation for > your app then you should make sure the driver knows about numa allocation > too and aligns everything for you. I found that out the hard way back in > 2010 and realized that the linux-kernel had no guidance for pcie > drivers(back then ... its ok now). I fixed it locally at the time. > > Hope this helps. > > Chetan Loke > > > On Monday, January 15, 2018 at 11:19:53 AM UTC-5, Kevin Bowling wrote: >> >> lmbench works well http://www.bitmover.com/lmbench/man_lmbench.html, >> and Larry seems happy to answer questions on building/using it. >> >> Unless you've explicitly built an application to work with NUMA, or >> are able to run two copies of an application pinned to each domain, >> you really only will get about 1 package worth of BW, and latecny is a >> bigger deal (which lmbench can also measure in cooperation with >> numactl) >> >> Regards, >> >> On Sun, Jan 14, 2018 at 11:44 AM, Peter Veentjer >> wrote: >> > I'm working on some very simple aggregations on huge chunks of offheap >> > memory (500GB+) for a hackaton. This is done using a very simple >> stride; >> > every iteration the address increases with 20 bytes. So the prefetcher >> > should not have any problems with it. >> > >> > According to my calculations I'm currently processing 35 GB/s. However >> I'm >> > not sure if I'm close to the maximum bandwidth of this machine. Specs: >> > 2133 MHz, 24x HP 32GiB 4Rx4 PC4-2133P >> > 2x Intel(R) Xeon(R) CPU E5-2687W v3, 3.10GHz, 10 cores per socket >> > >> > What is the best tool to determine the maximum bandwidth of a machine >> > running Linux (RHEL 7) >> > >> > -- >> > You received this message because you are subscribed to the Google >> Groups >> > "mechanical-sympathy" group. >> > To unsubscribe from this group and stop receiving emails from it, send >> an >> > email to mechanical-sympathy+unsubscr...@googlegroups.com. >> > For more options, visit https://groups.google.com/d/optout. >> > -- You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group. To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Determine memory bandwidth machine
Agree with Gil (w.r.t DIMM slots, channels etc) but may be not on the 'dd' script. Here's why. Some BIOS's have 'memory interleaving' option turned OFF. And if the OS-wide memory policy is non-interleaving too then case unless your application explicitly binds memory to a remote socket you cannot interleave memory. Or you would need to use numa tools (to set mem policy etc) while launching your application. Bandwidth or latency monitoring is also dependent on the workload you are running. If the workload a) runs atomics b) is running on Node-0 only but memory across both : then the snoop-responses are going take longer because socket-1 might be in power-saving states (uncore frequency etc). So ideally you would need a 'snoozer' thread on the remote socket(Node-1) which would prevent the socket from entering one of the 'C' (or whatever) states (or you can disable hardware power-saving modes - but you may need to line up all the options because the kernel may have power-saving options too). If you use the industry standard tools like 'stream' (as others mentioned) etc they will do all of this for you (via dummy/snoozer threads and so on). If you want to write this all by yourself then you should know the numa apis (http://man7.org/linux/man-pages/man3/numa.3.html), numa tools (numactl, taskset). For latency measurements you should also disable pre-fetching else it will give you super awesome numbers. Note: In future, if you use a device-driver that does the allocation for your app then you should make sure the driver knows about numa allocation too and aligns everything for you. I found that out the hard way back in 2010 and realized that the linux-kernel had no guidance for pcie drivers(back then ... its ok now). I fixed it locally at the time. Hope this helps. Chetan Loke On Monday, January 15, 2018 at 11:19:53 AM UTC-5, Kevin Bowling wrote: > > lmbench works well http://www.bitmover.com/lmbench/man_lmbench.html, > and Larry seems happy to answer questions on building/using it. > > Unless you've explicitly built an application to work with NUMA, or > are able to run two copies of an application pinned to each domain, > you really only will get about 1 package worth of BW, and latecny is a > bigger deal (which lmbench can also measure in cooperation with > numactl) > > Regards, > > On Sun, Jan 14, 2018 at 11:44 AM, Peter Veentjer > wrote: > > I'm working on some very simple aggregations on huge chunks of offheap > > memory (500GB+) for a hackaton. This is done using a very simple stride; > > every iteration the address increases with 20 bytes. So the prefetcher > > should not have any problems with it. > > > > According to my calculations I'm currently processing 35 GB/s. However > I'm > > not sure if I'm close to the maximum bandwidth of this machine. Specs: > > 2133 MHz, 24x HP 32GiB 4Rx4 PC4-2133P > > 2x Intel(R) Xeon(R) CPU E5-2687W v3, 3.10GHz, 10 cores per socket > > > > What is the best tool to determine the maximum bandwidth of a machine > > running Linux (RHEL 7) > > > > -- > > You received this message because you are subscribed to the Google > Groups > > "mechanical-sympathy" group. > > To unsubscribe from this group and stop receiving emails from it, send > an > > email to mechanical-sympathy+unsubscr...@googlegroups.com . > > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group. To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Mutex and Spectre/Meltdown
Hi, I saw a considerable slowdown with a microbenchmark on Linux systems with KPTI enabled vs. it being disabled. The benchmark performs no I/O but it is very futex-syscall-heavy. I unfortunately did not have the time to investigate further, but I also have the suspicion that the pagetable switching overhead when performing these syscalls is to blame. I would also be interested in more input on this. Regards, Marvin On Monday, January 15, 2018 at 6:40:07 PM UTC+1, Francesco Nigro wrote: > > Hi guys! > > Any of you have already measured (or simply know) if OS mutexes are > somehow affected by Spectre/Meltdown? > > Cheers, > Franz > > -- You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group. To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.