Re: [OMPI users] shared memory performance
Hello Cristian, TAU is still under active development and the developers respond fairly fast to emails. The latest version, 2.24.1, came out just two months ago. Check out https://www.cs.uoregon.edu/research/tau/home.php for more information. If you are running in to issues getting the latest version of TAU to work with Open MPI 1.8.x, check out the "Contact" page from the above URL. The developers are very helpful. Thanks, David On 07/22/2015 02:42 AM, Crisitan RUIZ wrote: Thank you for your answer Harald Actually I was already using TAU before but it seems that it is not maintained any more and there are problems when instrumenting applications with the version 1.8.5 of OpenMPI. I was using the OpenMPI 1.6.5 before to test the execution of HPC application on Linux containers. I tested the performance of NAS benchmarks in three different configurations: - 8 Linux containers on the same machine configured with 2 cores - 8 physical machines - 1 physical machine So, as I already described it, each machine counts with 2 processors (8 cores each). I instrumented and run all NAS benchmark in these three configurations and I got the results that I attached in this email. In the table "native" corresponds to using 8 physical machines and "SM" corresponds to 1 physical machine using the sm module, time is given in miliseconds. What surprise me in the results is that using containers in the worse case have equal communication time than just using plain mpi processes. Even though the containers use virtual network interfaces to communicate between them. Probably this behaviour is due to process binding and locality, I wanted to redo the test using OpenMPI version 1.8.5 but unfourtunately I couldn't sucessfully instrument the applications. I was looking for another MPI profiler but I couldn't find any. HPCToolkit looks like it is not maintain anymore, Vampir does not maintain any more the tool that instrument the application. I will probably give a try to Paraver. Best regards, Cristian Ruiz On 07/22/2015 09:44 AM, Harald Servat wrote: Cristian, you might observe super-speedup heres because in 8 nodes you have 8 times the cache you have in only 1 node. You can also validate that by checking for cache miss activity using the tools that I mentioned in my other email. Best regards. On 22/07/15 09:42, Crisitan RUIZ wrote: Sorry, I've just discovered that I was using the wrong command to run on 8 machines. I have to get rid of the "-np 8" So, I corrected the command and I used: mpirun --machinefile machine_mpi_bug.txt --mca btl self,sm,tcp --allow-run-as-root mg.C.8 And got these results 8 cores: Mop/s total = 19368.43 8 machines Mop/s total = 96094.35 Why is the performance of mult-node almost 4 times better than multi-core? Is this normal behavior? On 07/22/2015 09:19 AM, Crisitan RUIZ wrote: Hello, I'm running OpenMPI 1.8.5 on a cluster with the following characteristics: Each node is equipped with two Intel Xeon E5-2630v3 processors (with 8 cores each), 128 GB of RAM and a 10 Gigabit Ethernet adapter. When I run the NAS benchmarks using 8 cores in the same machine, I'm getting almost the same performance as using 8 machines running a mpi process per machine. I used the following commands: for running multi-node: mpirun -np 8 --machinefile machine_file.txt --mca btl self,sm,tcp --allow-run-as-root mg.C.8 for running in with 8 cores: mpirun -np 8 --mca btl self,sm,tcp --allow-run-as-root mg.C.8 I got the following results: 8 cores: Mop/s total = 19368.43 8 machines: Mop/s total = 19326.60 The results are similar for other benchmarks. Is this behavior normal? I was expecting to see a better behavior using 8 cores given that they use directly the memory to communicate. Thank you, Cristian ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2015/07/27295.php ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2015/07/27297.php WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received.
Re: [OMPI users] shared memory performance
Hi Christian, list I haven't been following the shared memory details of OMPI lately, but my recollection from some time ago is that in the 1.8 series the default (and recommended) shared memory transport btl switched from "sm" to "vader", which is the latest greatest. In this case, I guess the mpirun options would be: mpirun --machinefile machine_mpi_bug.txt --mca btl self,vader,tcp I am not even sure if with "vader" the "self" btl is needed, as it was the case with "sm". An OMPI developer could jump into this conversation and clarify. Thank you. I hope this helps, Gus Correa On 07/22/2015 04:42 AM, Crisitan RUIZ wrote: Thank you for your answer Harald Actually I was already using TAU before but it seems that it is not maintained any more and there are problems when instrumenting applications with the version 1.8.5 of OpenMPI. I was using the OpenMPI 1.6.5 before to test the execution of HPC application on Linux containers. I tested the performance of NAS benchmarks in three different configurations: - 8 Linux containers on the same machine configured with 2 cores - 8 physical machines - 1 physical machine So, as I already described it, each machine counts with 2 processors (8 cores each). I instrumented and run all NAS benchmark in these three configurations and I got the results that I attached in this email. In the table "native" corresponds to using 8 physical machines and "SM" corresponds to 1 physical machine using the sm module, time is given in miliseconds. What surprise me in the results is that using containers in the worse case have equal communication time than just using plain mpi processes. Even though the containers use virtual network interfaces to communicate between them. Probably this behaviour is due to process binding and locality, I wanted to redo the test using OpenMPI version 1.8.5 but unfourtunately I couldn't sucessfully instrument the applications. I was looking for another MPI profiler but I couldn't find any. HPCToolkit looks like it is not maintain anymore, Vampir does not maintain any more the tool that instrument the application. I will probably give a try to Paraver. Best regards, Cristian Ruiz On 07/22/2015 09:44 AM, Harald Servat wrote: Cristian, you might observe super-speedup heres because in 8 nodes you have 8 times the cache you have in only 1 node. You can also validate that by checking for cache miss activity using the tools that I mentioned in my other email. Best regards. On 22/07/15 09:42, Crisitan RUIZ wrote: Sorry, I've just discovered that I was using the wrong command to run on 8 machines. I have to get rid of the "-np 8" So, I corrected the command and I used: mpirun --machinefile machine_mpi_bug.txt --mca btl self,sm,tcp --allow-run-as-root mg.C.8 And got these results 8 cores: Mop/s total = 19368.43 8 machines Mop/s total = 96094.35 Why is the performance of mult-node almost 4 times better than multi-core? Is this normal behavior? On 07/22/2015 09:19 AM, Crisitan RUIZ wrote: Hello, I'm running OpenMPI 1.8.5 on a cluster with the following characteristics: Each node is equipped with two Intel Xeon E5-2630v3 processors (with 8 cores each), 128 GB of RAM and a 10 Gigabit Ethernet adapter. When I run the NAS benchmarks using 8 cores in the same machine, I'm getting almost the same performance as using 8 machines running a mpi process per machine. I used the following commands: for running multi-node: mpirun -np 8 --machinefile machine_file.txt --mca btl self,sm,tcp --allow-run-as-root mg.C.8 for running in with 8 cores: mpirun -np 8 --mca btl self,sm,tcp --allow-run-as-root mg.C.8 I got the following results: 8 cores: Mop/s total = 19368.43 8 machines: Mop/s total = 19326.60 The results are similar for other benchmarks. Is this behavior normal? I was expecting to see a better behavior using 8 cores given that they use directly the memory to communicate. Thank you, Cristian ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2015/07/27295.php ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2015/07/27297.php WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this
Re: [OMPI users] shared memory performance
Thank you for your answer Harald Actually I was already using TAU before but it seems that it is not maintained any more and there are problems when instrumenting applications with the version 1.8.5 of OpenMPI. I was using the OpenMPI 1.6.5 before to test the execution of HPC application on Linux containers. I tested the performance of NAS benchmarks in three different configurations: - 8 Linux containers on the same machine configured with 2 cores - 8 physical machines - 1 physical machine So, as I already described it, each machine counts with 2 processors (8 cores each). I instrumented and run all NAS benchmark in these three configurations and I got the results that I attached in this email. In the table "native" corresponds to using 8 physical machines and "SM" corresponds to 1 physical machine using the sm module, time is given in miliseconds. What surprise me in the results is that using containers in the worse case have equal communication time than just using plain mpi processes. Even though the containers use virtual network interfaces to communicate between them. Probably this behaviour is due to process binding and locality, I wanted to redo the test using OpenMPI version 1.8.5 but unfourtunately I couldn't sucessfully instrument the applications. I was looking for another MPI profiler but I couldn't find any. HPCToolkit looks like it is not maintain anymore, Vampir does not maintain any more the tool that instrument the application. I will probably give a try to Paraver. Best regards, Cristian Ruiz On 07/22/2015 09:44 AM, Harald Servat wrote: Cristian, you might observe super-speedup heres because in 8 nodes you have 8 times the cache you have in only 1 node. You can also validate that by checking for cache miss activity using the tools that I mentioned in my other email. Best regards. On 22/07/15 09:42, Crisitan RUIZ wrote: Sorry, I've just discovered that I was using the wrong command to run on 8 machines. I have to get rid of the "-np 8" So, I corrected the command and I used: mpirun --machinefile machine_mpi_bug.txt --mca btl self,sm,tcp --allow-run-as-root mg.C.8 And got these results 8 cores: Mop/s total = 19368.43 8 machines Mop/s total = 96094.35 Why is the performance of mult-node almost 4 times better than multi-core? Is this normal behavior? On 07/22/2015 09:19 AM, Crisitan RUIZ wrote: Hello, I'm running OpenMPI 1.8.5 on a cluster with the following characteristics: Each node is equipped with two Intel Xeon E5-2630v3 processors (with 8 cores each), 128 GB of RAM and a 10 Gigabit Ethernet adapter. When I run the NAS benchmarks using 8 cores in the same machine, I'm getting almost the same performance as using 8 machines running a mpi process per machine. I used the following commands: for running multi-node: mpirun -np 8 --machinefile machine_file.txt --mca btl self,sm,tcp --allow-run-as-root mg.C.8 for running in with 8 cores: mpirun -np 8 --mca btl self,sm,tcp --allow-run-as-root mg.C.8 I got the following results: 8 cores: Mop/s total = 19368.43 8 machines: Mop/s total = 19326.60 The results are similar for other benchmarks. Is this behavior normal? I was expecting to see a better behavior using 8 cores given that they use directly the memory to communicate. Thank you, Cristian ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2015/07/27295.php ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2015/07/27297.php WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received. http://www.bsc.es/disclaimer ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2015/07/27298.php
Re: [OMPI users] shared memory performance
Christian, one explanation could be that the benchmark is memory bound, so running on more sockets means higher memory bandwidth means better performance. an other explanation is that on one node, you are running one openmp thread per mpi task, and on 8 nodes, you are running 8 openmp threads per tasks Cheers, Gilles On 7/22/2015 4:42 PM, Crisitan RUIZ wrote: Sorry, I've just discovered that I was using the wrong command to run on 8 machines. I have to get rid of the "-np 8" So, I corrected the command and I used: mpirun --machinefile machine_mpi_bug.txt --mca btl self,sm,tcp --allow-run-as-root mg.C.8 And got these results 8 cores: Mop/s total = 19368.43 8 machines Mop/s total = 96094.35 Why is the performance of mult-node almost 4 times better than multi-core? Is this normal behavior? On 07/22/2015 09:19 AM, Crisitan RUIZ wrote: Hello, I'm running OpenMPI 1.8.5 on a cluster with the following characteristics: Each node is equipped with two Intel Xeon E5-2630v3 processors (with 8 cores each), 128 GB of RAM and a 10 Gigabit Ethernet adapter. When I run the NAS benchmarks using 8 cores in the same machine, I'm getting almost the same performance as using 8 machines running a mpi process per machine. I used the following commands: for running multi-node: mpirun -np 8 --machinefile machine_file.txt --mca btl self,sm,tcp --allow-run-as-root mg.C.8 for running in with 8 cores: mpirun -np 8 --mca btl self,sm,tcp --allow-run-as-root mg.C.8 I got the following results: 8 cores: Mop/s total = 19368.43 8 machines: Mop/s total = 19326.60 The results are similar for other benchmarks. Is this behavior normal? I was expecting to see a better behavior using 8 cores given that they use directly the memory to communicate. Thank you, Cristian ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2015/07/27295.php ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2015/07/27297.php
Re: [OMPI users] shared memory performance
Cristian, you might observe super-speedup heres because in 8 nodes you have 8 times the cache you have in only 1 node. You can also validate that by checking for cache miss activity using the tools that I mentioned in my other email. Best regards. On 22/07/15 09:42, Crisitan RUIZ wrote: Sorry, I've just discovered that I was using the wrong command to run on 8 machines. I have to get rid of the "-np 8" So, I corrected the command and I used: mpirun --machinefile machine_mpi_bug.txt --mca btl self,sm,tcp --allow-run-as-root mg.C.8 And got these results 8 cores: Mop/s total = 19368.43 8 machines Mop/s total = 96094.35 Why is the performance of mult-node almost 4 times better than multi-core? Is this normal behavior? On 07/22/2015 09:19 AM, Crisitan RUIZ wrote: Hello, I'm running OpenMPI 1.8.5 on a cluster with the following characteristics: Each node is equipped with two Intel Xeon E5-2630v3 processors (with 8 cores each), 128 GB of RAM and a 10 Gigabit Ethernet adapter. When I run the NAS benchmarks using 8 cores in the same machine, I'm getting almost the same performance as using 8 machines running a mpi process per machine. I used the following commands: for running multi-node: mpirun -np 8 --machinefile machine_file.txt --mca btl self,sm,tcp --allow-run-as-root mg.C.8 for running in with 8 cores: mpirun -np 8 --mca btl self,sm,tcp --allow-run-as-root mg.C.8 I got the following results: 8 cores: Mop/s total = 19368.43 8 machines: Mop/s total = 19326.60 The results are similar for other benchmarks. Is this behavior normal? I was expecting to see a better behavior using 8 cores given that they use directly the memory to communicate. Thank you, Cristian ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2015/07/27295.php ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2015/07/27297.php WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received. http://www.bsc.es/disclaimer
Re: [OMPI users] shared memory performance
Sorry, I've just discovered that I was using the wrong command to run on 8 machines. I have to get rid of the "-np 8" So, I corrected the command and I used: mpirun --machinefile machine_mpi_bug.txt --mca btl self,sm,tcp --allow-run-as-root mg.C.8 And got these results 8 cores: Mop/s total = 19368.43 8 machines Mop/s total = 96094.35 Why is the performance of mult-node almost 4 times better than multi-core? Is this normal behavior? On 07/22/2015 09:19 AM, Crisitan RUIZ wrote: Hello, I'm running OpenMPI 1.8.5 on a cluster with the following characteristics: Each node is equipped with two Intel Xeon E5-2630v3 processors (with 8 cores each), 128 GB of RAM and a 10 Gigabit Ethernet adapter. When I run the NAS benchmarks using 8 cores in the same machine, I'm getting almost the same performance as using 8 machines running a mpi process per machine. I used the following commands: for running multi-node: mpirun -np 8 --machinefile machine_file.txt --mca btl self,sm,tcp --allow-run-as-root mg.C.8 for running in with 8 cores: mpirun -np 8 --mca btl self,sm,tcp --allow-run-as-root mg.C.8 I got the following results: 8 cores: Mop/s total = 19368.43 8 machines: Mop/s total = 19326.60 The results are similar for other benchmarks. Is this behavior normal? I was expecting to see a better behavior using 8 cores given that they use directly the memory to communicate. Thank you, Cristian ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2015/07/27295.php
Re: [OMPI users] shared memory performance
Dear Cristian, as you probably know C class is one of the large classes for the NAS benchmarks. That is likely to mean that the application is taking much more time to do the actual computation rather than communication. This could explain why you see this little difference between the two execution: because communication is so little compared with the rest. In order to validate this reasoning, you can profile or trace the application using some of the performance tools available out there (Vampir, Paraver, TAU, HPCToolkit, Scalasca,...) and see which how is the communication compared to the computation. Regards. On 22/07/15 09:19, Crisitan RUIZ wrote: Hello, I'm running OpenMPI 1.8.5 on a cluster with the following characteristics: Each node is equipped with two Intel Xeon E5-2630v3 processors (with 8 cores each), 128 GB of RAM and a 10 Gigabit Ethernet adapter. When I run the NAS benchmarks using 8 cores in the same machine, I'm getting almost the same performance as using 8 machines running a mpi process per machine. I used the following commands: for running multi-node: mpirun -np 8 --machinefile machine_file.txt --mca btl self,sm,tcp --allow-run-as-root mg.C.8 for running in with 8 cores: mpirun -np 8 --mca btl self,sm,tcp --allow-run-as-root mg.C.8 I got the following results: 8 cores: Mop/s total = 19368.43 8 machines: Mop/s total = 19326.60 The results are similar for other benchmarks. Is this behavior normal? I was expecting to see a better behavior using 8 cores given that they use directly the memory to communicate. Thank you, Cristian ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2015/07/27295.php WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received. http://www.bsc.es/disclaimer
[OMPI users] shared memory performance
Hello, I'm running OpenMPI 1.8.5 on a cluster with the following characteristics: Each node is equipped with two Intel Xeon E5-2630v3 processors (with 8 cores each), 128 GB of RAM and a 10 Gigabit Ethernet adapter. When I run the NAS benchmarks using 8 cores in the same machine, I'm getting almost the same performance as using 8 machines running a mpi process per machine. I used the following commands: for running multi-node: mpirun -np 8 --machinefile machine_file.txt --mca btl self,sm,tcp --allow-run-as-root mg.C.8 for running in with 8 cores: mpirun -np 8 --mca btl self,sm,tcp --allow-run-as-root mg.C.8 I got the following results: 8 cores: Mop/s total = 19368.43 8 machines: Mop/s total = 19326.60 The results are similar for other benchmarks. Is this behavior normal? I was expecting to see a better behavior using 8 cores given that they use directly the memory to communicate. Thank you, Cristian