Re: [OMPI users] shared memory performance

2015-07-24 Thread Gilles Gouaillardet
Cristian,

one more thing...
two containers on the same host cannot communicate with the sm btl.
you might want to mpirun with --mca btl tcp,self on one physical machine
without container,
in order to asses the performance degradation due to using tcp btl and
without any containerization effect.

Cheers,

Gilles

On Friday, July 24, 2015, Harald Servat  wrote:

> Dear Cristian,
>
>   according to your configuration:
>
>   a) - 8 Linux containers on the same machine configured with 2 cores
>   b) - 8 physical machines
>   c) - 1 physical machine
>
>   a) and c) have exactly the same physical computational resources despite
> the fact that a) is being virtualized and the processors are oversubscribed
> (2 virtual cores per physical core). I'm not an expert on virtualization,
> but since a) and c) are exactly the same hardware (in terms of core and
> memory hierarchy), and your application seems memory bounded, I'd expect to
> see what you tabulated and b) is faster because you have 8 times the memory
> cache.
>
> Regards
> PS Your name in the mail is different, maybe you'd like to fix it.
>
> On 22/07/15 10:42, Crisitan RUIZ wrote:
>
>> Thank you for your answer Harald
>>
>> Actually I was already using TAU before but it seems that it is not
>> maintained any more and there are problems when instrumenting
>> applications with the version 1.8.5 of OpenMPI.
>>
>> I was using the OpenMPI 1.6.5 before to test the execution of HPC
>> application on Linux containers. I tested the performance of NAS
>> benchmarks in three different configurations:
>>
>> - 8 Linux containers on the same machine configured with 2 cores
>> - 8 physical machines
>> - 1 physical machine
>>
>> So, as I already described it, each machine counts with 2 processors (8
>> cores each). I instrumented and run all NAS benchmark in these three
>> configurations and I got the results that I attached in this email.
>> In the table "native" corresponds to using 8 physical machines and "SM"
>> corresponds to 1 physical machine using the sm module, time is given in
>> miliseconds.
>>
>> What surprise me in the results is that using containers in the worse
>> case have equal communication time than just using plain mpi processes.
>> Even though the containers use virtual network interfaces to communicate
>> between them. Probably this behaviour is due to process binding and
>> locality, I wanted to redo the test using OpenMPI version 1.8.5 but
>> unfourtunately I couldn't sucessfully instrument the applications. I was
>> looking for another MPI profiler but I couldn't find any. HPCToolkit
>> looks like it is not maintain anymore, Vampir does not maintain any more
>> the tool that instrument the application.  I will probably give a try to
>> Paraver.
>>
>>
>>
>> Best regards,
>>
>> Cristian Ruiz
>>
>>
>>
>> On 07/22/2015 09:44 AM, Harald Servat wrote:
>>
>>>
>>> Cristian,
>>>
>>>   you might observe super-speedup heres because in 8 nodes you have 8
>>> times the cache you have in only 1 node. You can also validate that by
>>> checking for cache miss activity using the tools that I mentioned in
>>> my other email.
>>>
>>> Best regards.
>>>
>>> On 22/07/15 09:42, Crisitan RUIZ wrote:
>>>
 Sorry, I've just discovered that I was using the wrong command to run on
 8 machines. I have to get rid of the "-np 8"

 So, I corrected the command and I used:

 mpirun --machinefile machine_mpi_bug.txt --mca btl self,sm,tcp
 --allow-run-as-root mg.C.8

 And got these results

 8 cores:

 Mop/s total = 19368.43


 8 machines

 Mop/s total = 96094.35


 Why is the performance of mult-node almost 4 times better than
 multi-core? Is this normal behavior?

 On 07/22/2015 09:19 AM, Crisitan RUIZ wrote:

>
>  Hello,
>
> I'm running OpenMPI 1.8.5 on a cluster with the following
> characteristics:
>
> Each node is equipped with two Intel Xeon E5-2630v3 processors (with 8
> cores each), 128 GB of RAM and a 10 Gigabit Ethernet adapter.
>
> When I run the NAS benchmarks using 8 cores in the same machine, I'm
> getting almost the same performance as using 8 machines running a mpi
> process per machine.
>
> I used the following commands:
>
> for running multi-node:
>
> mpirun -np 8 --machinefile machine_file.txt --mca btl self,sm,tcp
> --allow-run-as-root mg.C.8
>
> for running in with 8 cores:
>
> mpirun -np 8  --mca btl self,sm,tcp --allow-run-as-root mg.C.8
>
>
> I got the following results:
>
> 8 cores:
>
>  Mop/s total = 19368.43
>
> 8 machines:
>
> Mop/s total = 19326.60
>
>
> The results are similar for other benchmarks. Is this behavior normal?
> I was expecting to see a better behavior using 8 cores given that they
> use 

Re: [OMPI users] shared memory performance

2015-07-24 Thread Harald Servat

Dear Cristian,

  according to your configuration:

  a) - 8 Linux containers on the same machine configured with 2 cores
  b) - 8 physical machines
  c) - 1 physical machine

  a) and c) have exactly the same physical computational resources 
despite the fact that a) is being virtualized and the processors are 
oversubscribed (2 virtual cores per physical core). I'm not an expert on 
virtualization, but since a) and c) are exactly the same hardware (in 
terms of core and memory hierarchy), and your application seems memory 
bounded, I'd expect to see what you tabulated and b) is faster because 
you have 8 times the memory cache.


Regards
PS Your name in the mail is different, maybe you'd like to fix it.

On 22/07/15 10:42, Crisitan RUIZ wrote:

Thank you for your answer Harald

Actually I was already using TAU before but it seems that it is not
maintained any more and there are problems when instrumenting
applications with the version 1.8.5 of OpenMPI.

I was using the OpenMPI 1.6.5 before to test the execution of HPC
application on Linux containers. I tested the performance of NAS
benchmarks in three different configurations:

- 8 Linux containers on the same machine configured with 2 cores
- 8 physical machines
- 1 physical machine

So, as I already described it, each machine counts with 2 processors (8
cores each). I instrumented and run all NAS benchmark in these three
configurations and I got the results that I attached in this email.
In the table "native" corresponds to using 8 physical machines and "SM"
corresponds to 1 physical machine using the sm module, time is given in
miliseconds.

What surprise me in the results is that using containers in the worse
case have equal communication time than just using plain mpi processes.
Even though the containers use virtual network interfaces to communicate
between them. Probably this behaviour is due to process binding and
locality, I wanted to redo the test using OpenMPI version 1.8.5 but
unfourtunately I couldn't sucessfully instrument the applications. I was
looking for another MPI profiler but I couldn't find any. HPCToolkit
looks like it is not maintain anymore, Vampir does not maintain any more
the tool that instrument the application.  I will probably give a try to
Paraver.



Best regards,

Cristian Ruiz



On 07/22/2015 09:44 AM, Harald Servat wrote:


Cristian,

  you might observe super-speedup heres because in 8 nodes you have 8
times the cache you have in only 1 node. You can also validate that by
checking for cache miss activity using the tools that I mentioned in
my other email.

Best regards.

On 22/07/15 09:42, Crisitan RUIZ wrote:

Sorry, I've just discovered that I was using the wrong command to run on
8 machines. I have to get rid of the "-np 8"

So, I corrected the command and I used:

mpirun --machinefile machine_mpi_bug.txt --mca btl self,sm,tcp
--allow-run-as-root mg.C.8

And got these results

8 cores:

Mop/s total = 19368.43


8 machines

Mop/s total = 96094.35


Why is the performance of mult-node almost 4 times better than
multi-core? Is this normal behavior?

On 07/22/2015 09:19 AM, Crisitan RUIZ wrote:


 Hello,

I'm running OpenMPI 1.8.5 on a cluster with the following
characteristics:

Each node is equipped with two Intel Xeon E5-2630v3 processors (with 8
cores each), 128 GB of RAM and a 10 Gigabit Ethernet adapter.

When I run the NAS benchmarks using 8 cores in the same machine, I'm
getting almost the same performance as using 8 machines running a mpi
process per machine.

I used the following commands:

for running multi-node:

mpirun -np 8 --machinefile machine_file.txt --mca btl self,sm,tcp
--allow-run-as-root mg.C.8

for running in with 8 cores:

mpirun -np 8  --mca btl self,sm,tcp --allow-run-as-root mg.C.8


I got the following results:

8 cores:

 Mop/s total = 19368.43

8 machines:

Mop/s total = 19326.60


The results are similar for other benchmarks. Is this behavior normal?
I was expecting to see a better behavior using 8 cores given that they
use directly the memory to communicate.

Thank you,

Cristian
___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/07/27295.php


___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/07/27297.php



WARNING / LEGAL TEXT: This message is intended only for the use of the
individual or entity to which it is addressed and may contain
information which is privileged, confidential, proprietary, or exempt
from disclosure under applicable law. If you are not the intended
recipient or the person responsible for delivering the message to the
intended recipient, you are 

Re: [OMPI users] shared memory performance

2015-07-22 Thread David Shrader

Hello Cristian,

TAU is still under active development and the developers respond fairly 
fast to emails. The latest version, 2.24.1, came out just two months 
ago. Check out https://www.cs.uoregon.edu/research/tau/home.php for more 
information.


If you are running in to issues getting the latest version of TAU to 
work with Open MPI 1.8.x, check out the "Contact" page from the above 
URL. The developers are very helpful.


Thanks,
David

On 07/22/2015 02:42 AM, Crisitan RUIZ wrote:

Thank you for your answer Harald

Actually I was already using TAU before but it seems that it is not 
maintained any more and there are problems when instrumenting 
applications with the version 1.8.5 of OpenMPI.


I was using the OpenMPI 1.6.5 before to test the execution of HPC 
application on Linux containers. I tested the performance of NAS 
benchmarks in three different configurations:


- 8 Linux containers on the same machine configured with 2 cores
- 8 physical machines
- 1 physical machine

So, as I already described it, each machine counts with 2 processors 
(8 cores each). I instrumented and run all NAS benchmark in these 
three configurations and I got the results that I attached in this email.
In the table "native" corresponds to using 8 physical machines and 
"SM" corresponds to 1 physical machine using the sm module, time is 
given in miliseconds.


What surprise me in the results is that using containers in the worse 
case have equal communication time than just using plain mpi 
processes. Even though the containers use virtual network interfaces 
to communicate between them. Probably this behaviour is due to process 
binding and locality, I wanted to redo the test using OpenMPI version 
1.8.5 but unfourtunately I couldn't sucessfully instrument the 
applications. I was looking for another MPI profiler but I couldn't 
find any. HPCToolkit looks like it is not maintain anymore, Vampir 
does not maintain any more the tool that instrument the application.  
I will probably give a try to Paraver.




Best regards,

Cristian Ruiz



On 07/22/2015 09:44 AM, Harald Servat wrote:


Cristian,

  you might observe super-speedup heres because in 8 nodes you have 8 
times the cache you have in only 1 node. You can also validate that 
by checking for cache miss activity using the tools that I mentioned 
in my other email.


Best regards.

On 22/07/15 09:42, Crisitan RUIZ wrote:
Sorry, I've just discovered that I was using the wrong command to 
run on

8 machines. I have to get rid of the "-np 8"

So, I corrected the command and I used:

mpirun --machinefile machine_mpi_bug.txt --mca btl self,sm,tcp
--allow-run-as-root mg.C.8

And got these results

8 cores:

Mop/s total = 19368.43


8 machines

Mop/s total = 96094.35


Why is the performance of mult-node almost 4 times better than
multi-core? Is this normal behavior?

On 07/22/2015 09:19 AM, Crisitan RUIZ wrote:


 Hello,

I'm running OpenMPI 1.8.5 on a cluster with the following
characteristics:

Each node is equipped with two Intel Xeon E5-2630v3 processors (with 8
cores each), 128 GB of RAM and a 10 Gigabit Ethernet adapter.

When I run the NAS benchmarks using 8 cores in the same machine, I'm
getting almost the same performance as using 8 machines running a mpi
process per machine.

I used the following commands:

for running multi-node:

mpirun -np 8 --machinefile machine_file.txt --mca btl self,sm,tcp
--allow-run-as-root mg.C.8

for running in with 8 cores:

mpirun -np 8  --mca btl self,sm,tcp --allow-run-as-root mg.C.8


I got the following results:

8 cores:

 Mop/s total = 19368.43

8 machines:

Mop/s total = 19326.60


The results are similar for other benchmarks. Is this behavior normal?
I was expecting to see a better behavior using 8 cores given that they
use directly the memory to communicate.

Thank you,

Cristian
___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/07/27295.php


___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/07/27297.php



WARNING / LEGAL TEXT: This message is intended only for the use of the
individual or entity to which it is addressed and may contain
information which is privileged, confidential, proprietary, or exempt
from disclosure under applicable law. If you are not the intended
recipient or the person responsible for delivering the message to the
intended recipient, you are strictly prohibited from disclosing,
distributing, copying, or in any way using this message. If you have
received this communication in error, please notify the sender and
destroy and delete any copies you may have received.


Re: [OMPI users] shared memory performance

2015-07-22 Thread Gus Correa

Hi Christian, list

I haven't been following the shared memory details of OMPI lately,
but my recollection from some time ago is that in the 1.8 series the
default (and recommended) shared memory transport btl switched from
"sm" to "vader", which is the latest greatest.

In this case, I guess the mpirun options would be:

mpirun --machinefile machine_mpi_bug.txt --mca btl self,vader,tcp


I am not even sure if with "vader" the "self" btl is needed,
as it was the case with "sm".

An OMPI developer could jump into this conversation and clarify.
Thank you.

I hope this helps,
Gus Correa


On 07/22/2015 04:42 AM, Crisitan RUIZ wrote:

Thank you for your answer Harald

Actually I was already using TAU before but it seems that it is not
maintained any more and there are problems when instrumenting
applications with the version 1.8.5 of OpenMPI.

I was using the OpenMPI 1.6.5 before to test the execution of HPC
application on Linux containers. I tested the performance of NAS
benchmarks in three different configurations:

- 8 Linux containers on the same machine configured with 2 cores
- 8 physical machines
- 1 physical machine

So, as I already described it, each machine counts with 2 processors (8
cores each). I instrumented and run all NAS benchmark in these three
configurations and I got the results that I attached in this email.
In the table "native" corresponds to using 8 physical machines and "SM"
corresponds to 1 physical machine using the sm module, time is given in
miliseconds.

What surprise me in the results is that using containers in the worse
case have equal communication time than just using plain mpi processes.
Even though the containers use virtual network interfaces to communicate
between them. Probably this behaviour is due to process binding and
locality, I wanted to redo the test using OpenMPI version 1.8.5 but
unfourtunately I couldn't sucessfully instrument the applications. I was
looking for another MPI profiler but I couldn't find any. HPCToolkit
looks like it is not maintain anymore, Vampir does not maintain any more
the tool that instrument the application.  I will probably give a try to
Paraver.



Best regards,

Cristian Ruiz



On 07/22/2015 09:44 AM, Harald Servat wrote:


Cristian,

  you might observe super-speedup heres because in 8 nodes you have 8
times the cache you have in only 1 node. You can also validate that by
checking for cache miss activity using the tools that I mentioned in
my other email.

Best regards.

On 22/07/15 09:42, Crisitan RUIZ wrote:

Sorry, I've just discovered that I was using the wrong command to run on
8 machines. I have to get rid of the "-np 8"

So, I corrected the command and I used:

mpirun --machinefile machine_mpi_bug.txt --mca btl self,sm,tcp
--allow-run-as-root mg.C.8

And got these results

8 cores:

Mop/s total = 19368.43


8 machines

Mop/s total = 96094.35


Why is the performance of mult-node almost 4 times better than
multi-core? Is this normal behavior?

On 07/22/2015 09:19 AM, Crisitan RUIZ wrote:


 Hello,

I'm running OpenMPI 1.8.5 on a cluster with the following
characteristics:

Each node is equipped with two Intel Xeon E5-2630v3 processors (with 8
cores each), 128 GB of RAM and a 10 Gigabit Ethernet adapter.

When I run the NAS benchmarks using 8 cores in the same machine, I'm
getting almost the same performance as using 8 machines running a mpi
process per machine.

I used the following commands:

for running multi-node:

mpirun -np 8 --machinefile machine_file.txt --mca btl self,sm,tcp
--allow-run-as-root mg.C.8

for running in with 8 cores:

mpirun -np 8  --mca btl self,sm,tcp --allow-run-as-root mg.C.8


I got the following results:

8 cores:

 Mop/s total = 19368.43

8 machines:

Mop/s total = 19326.60


The results are similar for other benchmarks. Is this behavior normal?
I was expecting to see a better behavior using 8 cores given that they
use directly the memory to communicate.

Thank you,

Cristian
___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/07/27295.php


___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/07/27297.php



WARNING / LEGAL TEXT: This message is intended only for the use of the
individual or entity to which it is addressed and may contain
information which is privileged, confidential, proprietary, or exempt
from disclosure under applicable law. If you are not the intended
recipient or the person responsible for delivering the message to the
intended recipient, you are strictly prohibited from disclosing,
distributing, copying, or in any way using this message. If you have
received this 

Re: [OMPI users] shared memory performance

2015-07-22 Thread Crisitan RUIZ

Thank you for your answer Harald

Actually I was already using TAU before but it seems that it is not 
maintained any more and there are problems when instrumenting 
applications with the version 1.8.5 of OpenMPI.


I was using the OpenMPI 1.6.5 before to test the execution of HPC 
application on Linux containers. I tested the performance of NAS 
benchmarks in three different configurations:


- 8 Linux containers on the same machine configured with 2 cores
- 8 physical machines
- 1 physical machine

So, as I already described it, each machine counts with 2 processors (8 
cores each). I instrumented and run all NAS benchmark in these three 
configurations and I got the results that I attached in this email.
In the table "native" corresponds to using 8 physical machines and "SM" 
corresponds to 1 physical machine using the sm module, time is given in 
miliseconds.


What surprise me in the results is that using containers in the worse 
case have equal communication time than just using plain mpi processes. 
Even though the containers use virtual network interfaces to communicate 
between them. Probably this behaviour is due to process binding and 
locality, I wanted to redo the test using OpenMPI version 1.8.5 but 
unfourtunately I couldn't sucessfully instrument the applications. I was 
looking for another MPI profiler but I couldn't find any. HPCToolkit 
looks like it is not maintain anymore, Vampir does not maintain any more 
the tool that instrument the application.  I will probably give a try to 
Paraver.




Best regards,

Cristian Ruiz



On 07/22/2015 09:44 AM, Harald Servat wrote:


Cristian,

  you might observe super-speedup heres because in 8 nodes you have 8 
times the cache you have in only 1 node. You can also validate that by 
checking for cache miss activity using the tools that I mentioned in 
my other email.


Best regards.

On 22/07/15 09:42, Crisitan RUIZ wrote:

Sorry, I've just discovered that I was using the wrong command to run on
8 machines. I have to get rid of the "-np 8"

So, I corrected the command and I used:

mpirun --machinefile machine_mpi_bug.txt --mca btl self,sm,tcp
--allow-run-as-root mg.C.8

And got these results

8 cores:

Mop/s total = 19368.43


8 machines

Mop/s total = 96094.35


Why is the performance of mult-node almost 4 times better than
multi-core? Is this normal behavior?

On 07/22/2015 09:19 AM, Crisitan RUIZ wrote:


 Hello,

I'm running OpenMPI 1.8.5 on a cluster with the following
characteristics:

Each node is equipped with two Intel Xeon E5-2630v3 processors (with 8
cores each), 128 GB of RAM and a 10 Gigabit Ethernet adapter.

When I run the NAS benchmarks using 8 cores in the same machine, I'm
getting almost the same performance as using 8 machines running a mpi
process per machine.

I used the following commands:

for running multi-node:

mpirun -np 8 --machinefile machine_file.txt --mca btl self,sm,tcp
--allow-run-as-root mg.C.8

for running in with 8 cores:

mpirun -np 8  --mca btl self,sm,tcp --allow-run-as-root mg.C.8


I got the following results:

8 cores:

 Mop/s total = 19368.43

8 machines:

Mop/s total = 19326.60


The results are similar for other benchmarks. Is this behavior normal?
I was expecting to see a better behavior using 8 cores given that they
use directly the memory to communicate.

Thank you,

Cristian
___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/07/27295.php


___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/07/27297.php



WARNING / LEGAL TEXT: This message is intended only for the use of the
individual or entity to which it is addressed and may contain
information which is privileged, confidential, proprietary, or exempt
from disclosure under applicable law. If you are not the intended
recipient or the person responsible for delivering the message to the
intended recipient, you are strictly prohibited from disclosing,
distributing, copying, or in any way using this message. If you have
received this communication in error, please notify the sender and
destroy and delete any copies you may have received.

http://www.bsc.es/disclaimer
___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/07/27298.php




Re: [OMPI users] shared memory performance

2015-07-22 Thread Gilles Gouaillardet

Christian,

one explanation could be that the benchmark is memory bound, so running 
on more sockets means higher memory bandwidth means better performance.


an other explanation is that on one node, you are running one openmp 
thread per mpi task, and on 8 nodes, you are running 8 openmp threads 
per tasks


Cheers,

Gilles

On 7/22/2015 4:42 PM, Crisitan RUIZ wrote:
Sorry, I've just discovered that I was using the wrong command to run 
on 8 machines. I have to get rid of the "-np 8"


So, I corrected the command and I used:

mpirun --machinefile machine_mpi_bug.txt --mca btl self,sm,tcp 
--allow-run-as-root mg.C.8


And got these results

8 cores:

Mop/s total = 19368.43


8 machines

Mop/s total = 96094.35


Why is the performance of mult-node almost 4 times better than 
multi-core? Is this normal behavior?


On 07/22/2015 09:19 AM, Crisitan RUIZ wrote:


 Hello,

I'm running OpenMPI 1.8.5 on a cluster with the following 
characteristics:


Each node is equipped with two Intel Xeon E5-2630v3 processors (with 
8 cores each), 128 GB of RAM and a 10 Gigabit Ethernet adapter.


When I run the NAS benchmarks using 8 cores in the same machine, I'm 
getting almost the same performance as using 8 machines running a mpi 
process per machine.


I used the following commands:

for running multi-node:

mpirun -np 8 --machinefile machine_file.txt --mca btl self,sm,tcp 
--allow-run-as-root mg.C.8


for running in with 8 cores:

mpirun -np 8  --mca btl self,sm,tcp --allow-run-as-root mg.C.8


I got the following results:

8 cores:

 Mop/s total = 19368.43

8 machines:

Mop/s total = 19326.60


The results are similar for other benchmarks. Is this behavior 
normal? I was expecting to see a better behavior using 8 cores given 
that they use directly the memory to communicate.


Thank you,

Cristian
___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/07/27295.php


___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/07/27297.php







Re: [OMPI users] shared memory performance

2015-07-22 Thread Harald Servat


Cristian,

  you might observe super-speedup heres because in 8 nodes you have 8 
times the cache you have in only 1 node. You can also validate that by 
checking for cache miss activity using the tools that I mentioned in my 
other email.


Best regards.

On 22/07/15 09:42, Crisitan RUIZ wrote:

Sorry, I've just discovered that I was using the wrong command to run on
8 machines. I have to get rid of the "-np 8"

So, I corrected the command and I used:

mpirun --machinefile machine_mpi_bug.txt --mca btl self,sm,tcp
--allow-run-as-root mg.C.8

And got these results

8 cores:

Mop/s total = 19368.43


8 machines

Mop/s total = 96094.35


Why is the performance of mult-node almost 4 times better than
multi-core? Is this normal behavior?

On 07/22/2015 09:19 AM, Crisitan RUIZ wrote:


 Hello,

I'm running OpenMPI 1.8.5 on a cluster with the following
characteristics:

Each node is equipped with two Intel Xeon E5-2630v3 processors (with 8
cores each), 128 GB of RAM and a 10 Gigabit Ethernet adapter.

When I run the NAS benchmarks using 8 cores in the same machine, I'm
getting almost the same performance as using 8 machines running a mpi
process per machine.

I used the following commands:

for running multi-node:

mpirun -np 8 --machinefile machine_file.txt --mca btl self,sm,tcp
--allow-run-as-root mg.C.8

for running in with 8 cores:

mpirun -np 8  --mca btl self,sm,tcp --allow-run-as-root mg.C.8


I got the following results:

8 cores:

 Mop/s total = 19368.43

8 machines:

Mop/s total = 19326.60


The results are similar for other benchmarks. Is this behavior normal?
I was expecting to see a better behavior using 8 cores given that they
use directly the memory to communicate.

Thank you,

Cristian
___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/07/27295.php


___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/07/27297.php



WARNING / LEGAL TEXT: This message is intended only for the use of the
individual or entity to which it is addressed and may contain
information which is privileged, confidential, proprietary, or exempt
from disclosure under applicable law. If you are not the intended
recipient or the person responsible for delivering the message to the
intended recipient, you are strictly prohibited from disclosing,
distributing, copying, or in any way using this message. If you have
received this communication in error, please notify the sender and
destroy and delete any copies you may have received.

http://www.bsc.es/disclaimer


Re: [OMPI users] shared memory performance

2015-07-22 Thread Crisitan RUIZ
Sorry, I've just discovered that I was using the wrong command to run on 
8 machines. I have to get rid of the "-np 8"


So, I corrected the command and I used:

mpirun --machinefile machine_mpi_bug.txt --mca btl self,sm,tcp 
--allow-run-as-root mg.C.8


And got these results

8 cores:

Mop/s total = 19368.43


8 machines

Mop/s total = 96094.35


Why is the performance of mult-node almost 4 times better than 
multi-core? Is this normal behavior?


On 07/22/2015 09:19 AM, Crisitan RUIZ wrote:


 Hello,

I'm running OpenMPI 1.8.5 on a cluster with the following 
characteristics:


Each node is equipped with two Intel Xeon E5-2630v3 processors (with 8 
cores each), 128 GB of RAM and a 10 Gigabit Ethernet adapter.


When I run the NAS benchmarks using 8 cores in the same machine, I'm 
getting almost the same performance as using 8 machines running a mpi 
process per machine.


I used the following commands:

for running multi-node:

mpirun -np 8 --machinefile machine_file.txt --mca btl self,sm,tcp 
--allow-run-as-root mg.C.8


for running in with 8 cores:

mpirun -np 8  --mca btl self,sm,tcp --allow-run-as-root mg.C.8


I got the following results:

8 cores:

 Mop/s total = 19368.43

8 machines:

Mop/s total = 19326.60


The results are similar for other benchmarks. Is this behavior normal? 
I was expecting to see a better behavior using 8 cores given that they 
use directly the memory to communicate.


Thank you,

Cristian
___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/07/27295.php




Re: [OMPI users] shared memory performance

2015-07-22 Thread Harald Servat


Dear Cristian,

  as you probably know C class is one of the large classes for the NAS 
benchmarks. That is likely to mean that the application is taking much 
more time to do the actual computation rather than communication. This 
could explain why you see this little difference between the two 
execution: because communication is so little compared with the rest.


  In order to validate this reasoning, you can profile or trace the 
application using some of the performance tools available out there 
(Vampir, Paraver, TAU, HPCToolkit, Scalasca,...) and see which how is 
the communication compared to the computation.


Regards.

On 22/07/15 09:19, Crisitan RUIZ wrote:


  Hello,

I'm running OpenMPI 1.8.5 on a cluster with the following characteristics:

Each node is equipped with two Intel Xeon E5-2630v3 processors (with 8
cores each), 128 GB of RAM and a 10 Gigabit Ethernet adapter.

When I run the NAS benchmarks using 8 cores in the same machine, I'm
getting almost the same performance as using 8 machines running a mpi
process per machine.

I used the following commands:

for running multi-node:

mpirun -np 8 --machinefile machine_file.txt --mca btl self,sm,tcp
--allow-run-as-root mg.C.8

for running in with 8 cores:

mpirun -np 8  --mca btl self,sm,tcp --allow-run-as-root mg.C.8


I got the following results:

8 cores:

  Mop/s total = 19368.43

8 machines:

Mop/s total = 19326.60


The results are similar for other benchmarks. Is this behavior normal? I
was expecting to see a better behavior using 8 cores given that they use
directly the memory to communicate.

Thank you,

Cristian
___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/07/27295.php



WARNING / LEGAL TEXT: This message is intended only for the use of the
individual or entity to which it is addressed and may contain
information which is privileged, confidential, proprietary, or exempt
from disclosure under applicable law. If you are not the intended
recipient or the person responsible for delivering the message to the
intended recipient, you are strictly prohibited from disclosing,
distributing, copying, or in any way using this message. If you have
received this communication in error, please notify the sender and
destroy and delete any copies you may have received.

http://www.bsc.es/disclaimer


[OMPI users] shared memory performance

2015-07-22 Thread Crisitan RUIZ


 Hello,

I'm running OpenMPI 1.8.5 on a cluster with the following characteristics:

Each node is equipped with two Intel Xeon E5-2630v3 processors (with 8 
cores each), 128 GB of RAM and a 10 Gigabit Ethernet adapter.


When I run the NAS benchmarks using 8 cores in the same machine, I'm 
getting almost the same performance as using 8 machines running a mpi 
process per machine.


I used the following commands:

for running multi-node:

mpirun -np 8 --machinefile machine_file.txt --mca btl self,sm,tcp 
--allow-run-as-root mg.C.8


for running in with 8 cores:

mpirun -np 8  --mca btl self,sm,tcp --allow-run-as-root mg.C.8


I got the following results:

8 cores:

 Mop/s total = 19368.43

8 machines:

Mop/s total = 19326.60


The results are similar for other benchmarks. Is this behavior normal? I 
was expecting to see a better behavior using 8 cores given that they use 
directly the memory to communicate.


Thank you,

Cristian


Re: [OMPI users] Shared Memory Performance Problem.

2011-03-30 Thread Tim Prince

On 3/30/2011 10:08 AM, Eugene Loh wrote:

Michele Marena wrote:

I've launched my app with mpiP both when two processes are on
different node and when two processes are on the same node.

The process 0 is the manager (gathers the results only), processes 1
and 2 are workers (compute).

This is the case processes 1 and 2 are on different nodes (runs in 162s).
@--- MPI Time (seconds)
---
Task AppTime MPITime MPI%
0 162 162 99.99
1 162 30.2 18.66
2 162 14.7 9.04
* 486 207 42.56

The case when processes 1 and 2 are on the same node (runs in 260s).
@--- MPI Time (seconds)
---
Task AppTime MPITime MPI%
0 260 260 99.99
1 260 39.7 15.29
2 260 26.4 10.17
* 779 326 41.82

I think there's a contention problem on the memory bus.

Right. Process 0 spends all its time in MPI, presumably waiting on
workers. The workers spend about the same amount of time on MPI
regardless of whether they're placed together or not. The big difference
is that the workers are much slower in non-MPI tasks when they're
located on the same node. The issue has little to do with MPI. The
workers are hogging local resources and work faster when placed on
different nodes.

However, the message size is 4096 * sizeof(double). Maybe I are wrong
in this point. Is the message size too huge for shared memory?

No. That's not very large at all.


Not even large enough to expect the non-temporal storage issue about 
cache eviction to arise.



--
Tim Prince


Re: [OMPI users] Shared Memory Performance Problem.

2011-03-30 Thread Eugene Loh




Michele Marena wrote:
I've launched my app with mpiP both when two processes are
on different node and when two processes are on the same node.
  
  
  The process 0 is the manager (gathers the results only),
processes 1 and 2 are  workers (compute).
  
  
  This is the case processes 1 and 2 are on different nodes (runs
in 162s).
  
  @--- MPI Time (seconds)
---
  Task    AppTime    MPITime     MPI%
     0        162        162    99.99
     1        162       30.2    18.66
     2        162       14.7     9.04
     *        486        207    42.56
  
  
  The case when processes 1 and 2 are on the same node (runs in
260s).
  
  @--- MPI Time (seconds)
---
  Task    AppTime    MPITime     MPI%
     0        260        260    99.99
     1        260       39.7    15.29
     2        260       26.4    10.17
     *        779        326    41.82
  
  
  
  I think there's a contention problem on the memory bus.
  

Right.  Process 0 spends all its time in MPI, presumably waiting on
workers.  The workers spend about the same amount of time on MPI
regardless of whether they're placed together or not.  The big
difference is that the workers are much slower in non-MPI tasks when
they're located on the same node.  The issue has little to do with
MPI.  The workers are hogging local resources and work faster when
placed on different nodes.

  
  However, the message size is 4096 * sizeof(double). Maybe I are
wrong in this point. Is the message size too huge for shared memory?
  

No.  That's not very large at all.

  
  
  

>>> On Mar 27, 2011, at 10:33 AM, Ralph
Castain wrote:
>>>
>>> >http://www.open-mpi.org/faq/?category=perftools

  
  
  





Re: [OMPI users] Shared Memory Performance Problem.

2011-03-30 Thread Michele Marena
Hi Jeff,
I thank you for your help,
I've launched my app with mpiP both when two processes are on different node
and when two processes are on the same node.

The process 0 is the manager (gathers the results only), processes 1 and 2
are  workers (compute).

This is the case processes 1 and 2 are on different nodes (runs in 162s).

---
@--- MPI Time (seconds) ---
---
TaskAppTimeMPITime MPI%
   016216299.99
   1162   30.218.66
   2162   14.7 9.04
   *48620742.56
---
@--- Aggregate Time (top twenty, descending, milliseconds) 
---
Call Site   TimeApp%MPI% COV
Barrier 5   1.28e+05   26.24   61.640.00
Barrier142.3e+044.74   11.130.00
Barrier 6   2.29e+044.72   11.080.00
Barrier17   1.77e+043.658.581.41
Recv3   1.15e+042.375.580.00
Recv   30   2.26e+030.471.090.00
Recv   123080.060.150.00
Recv   262860.060.140.00
Recv   282520.050.120.00
Recv   312460.050.120.00
Isend  351110.020.050.00
Isend  341080.020.050.00
Isend  181070.020.050.00
Isend  191050.020.050.00
Isend   9   57.70.010.030.05
Isend  32   39.70.010.020.00
Barrier25   38.70.010.021.39
Isend  11   38.60.010.020.00
Recv   16   34.10.010.020.00
Recv   27   26.50.010.010.00
---
@--- Aggregate Sent Message Size (top twenty, descending, bytes) --
---
Call Site  Count  Total   Avrg  Sent%
Isend   9   4096   1.34e+08   3.28e+04  58.73
Isend  34   1200   1.85e+07   1.54e+04   8.07
Isend  35   1200   1.85e+07   1.54e+04   8.07
Isend  18   1200   1.85e+07   1.54e+04   8.07
Isend  19   1200   1.85e+07   1.54e+04   8.07
Isend  32240   3.69e+06   1.54e+04   1.61
Isend  11240   3.69e+06   1.54e+04   1.61
Isend  15180   3.44e+06   1.91e+04   1.51
Isend  33 61  2e+06   3.28e+04   0.87
Isend  10 61  2e+06   3.28e+04   0.87
Isend  29 61  2e+06   3.28e+04   0.87
Isend  22 61  2e+06   3.28e+04   0.87
Isend  37180   1.72e+06   9.57e+03   0.75
Isend  24  2 16  8   0.00
Isend  20  2 16  8   0.00
Send8  1  4  4   0.00
Send1  1  4  4   0.00

The case when processes 1 and 2 are on the same node (runs in 260s).
---
@--- MPI Time (seconds) ---
---
TaskAppTimeMPITime MPI%
   026026099.99
   1260   39.715.29
   2260   26.410.17
   *77932641.82

---
@--- Aggregate Time (top twenty, descending, milliseconds) 
---
Call Site   TimeApp%MPI% COV
Barrier 5   2.23e+05   28.64   68.500.00
Barrier 6   2.49e+043.207.660.00
Barrier14   2.31e+042.967.090.00
Recv   28   1.35e+041.734.140.00
Recv   16   1.32e+041.704.060.00
Barrier17   1.22e+041.563.741.41
Recv3   1.16e+041.483.550.00
Recv   26   1.67e+030.210.510.00
Recv   309400.120.290.00
Recv   

Re: [OMPI users] Shared Memory Performance Problem.

2011-03-30 Thread Jeff Squyres
How many messages are you sending, and how large are they?  I.e., if your 
message passing is tiny, then the network transport may not be the bottleneck 
here.


On Mar 28, 2011, at 9:41 AM, Michele Marena wrote:

> I run ompi_info --param btl sm and this is the output
> 
>  MCA btl: parameter "btl_base_debug" (current value: "0")
>   If btl_base_debug is 1 standard debug is output, if 
> > 1 verbose debug is output
>  MCA btl: parameter "btl" (current value: )
>   Default selection set of components for the btl 
> framework ( means "use all components that can be found")
>  MCA btl: parameter "btl_base_verbose" (current value: "0")
>   Verbosity level for the btl framework (0 = no 
> verbosity)
>  MCA btl: parameter "btl_sm_free_list_num" (current value: 
> "8")
>  MCA btl: parameter "btl_sm_free_list_max" (current value: 
> "-1")
>  MCA btl: parameter "btl_sm_free_list_inc" (current value: 
> "64")
>  MCA btl: parameter "btl_sm_exclusivity" (current value: 
> "65535")
>  MCA btl: parameter "btl_sm_latency" (current value: "100")
>  MCA btl: parameter "btl_sm_max_procs" (current value: "-1")
>  MCA btl: parameter "btl_sm_sm_extra_procs" (current value: 
> "2")
>  MCA btl: parameter "btl_sm_mpool" (current value: "sm")
>  MCA btl: parameter "btl_sm_eager_limit" (current value: 
> "4096")
>  MCA btl: parameter "btl_sm_max_frag_size" (current value: 
> "32768")
>  MCA btl: parameter "btl_sm_size_of_cb_queue" (current value: 
> "128")
>  MCA btl: parameter "btl_sm_cb_lazy_free_freq" (current 
> value: "120")
>  MCA btl: parameter "btl_sm_priority" (current value: "0")
>  MCA btl: parameter "btl_base_warn_component_unused" (current 
> value: "1")
>   This parameter is used to turn on warning messages 
> when certain NICs are not used
> 
> 
> 2011/3/28 Ralph Castain 
> The fact that this exactly matches the time you measured with shared memory 
> is suspicious. My guess is that you aren't actually using shared memory at 
> all.
> 
> Does your "ompi_info" output show shared memory as being available? Jeff or 
> others may be able to give you some params that would let you check to see if 
> sm is actually being used between those procs.
> 
> 
> 
> On Mar 28, 2011, at 7:51 AM, Michele Marena wrote:
> 
>> What happens with 2 processes on the same node with tcp?
>> With --mca btl self,tcp my app runs in 23s.
>> 
>> 2011/3/28 Jeff Squyres (jsquyres) 
>> Ah, I didn't catch before that there were more variables than just tcp vs. 
>> shmem. 
>> 
>> What happens with 2 processes on the same node with tcp?
>> 
>> Eg, when both procs are on the same node, are you thrashing caches or memory?
>> 
>> Sent from my phone. No type good. 
>> 
>> On Mar 28, 2011, at 6:27 AM, "Michele Marena"  
>> wrote:
>> 
>>> However, I thank you Tim, Ralh and Jeff.
>>> My sequential application runs in 24s (wall clock time).
>>> My parallel application runs in 13s with two processes on different nodes.
>>> With shared memory, when two processes are on the same node, my app runs in 
>>> 23s.
>>> I'm not understand why.
>>> 
>>> 2011/3/28 Jeff Squyres 
>>> If your program runs faster across 3 processes, 2 of which are local to 
>>> each other, with --mca btl tcp,self compared to --mca btl tcp,sm,self, then 
>>> something is very, very strange.
>>> 
>>> Tim cites all kinds of things that can cause slowdowns, but it's still 
>>> very, very odd that simply enabling using the shared memory communications 
>>> channel in Open MPI *slows your overall application down*.
>>> 
>>> How much does your application slow down in wall clock time?  Seconds?  
>>> Minutes?  Hours?  (anything less than 1 second is in the noise)
>>> 
>>> 
>>> 
>>> On Mar 27, 2011, at 10:33 AM, Ralph Castain wrote:
>>> 
>>> >
>>> > On Mar 27, 2011, at 7:37 AM, Tim Prince wrote:
>>> >
>>> >> On 3/27/2011 2:26 AM, Michele Marena wrote:
>>> >>> Hi,
>>> >>> My application performs good without shared memory utilization, but with
>>> >>> shared memory I get performance worst than without of it.
>>> >>> Do I make a mistake? Don't I pay attention to something?
>>> >>> I know OpenMPI uses /tmp directory to allocate shared memory and it is
>>> >>> in the local filesystem.
>>> >>>
>>> >>
>>> >> I guess you mean shared memory message passing.   Among relevant 
>>> >> parameters may be the message size where your implementation switches 
>>> >> from cached copy to non-temporal (if you are on a platform where that 
>>> >> terminology is used).  If built with Intel compilers, for example, the 
>>> >> copy may be performed by intel_fast_memcpy, with a 

Re: [OMPI users] Shared Memory Performance Problem.

2011-03-28 Thread Michele Marena
I run ompi_info --param btl sm and this is the output

 MCA btl: parameter "btl_base_debug" (current value: "0")
  If btl_base_debug is 1 standard debug is output,
if > 1 verbose debug is output
 MCA btl: parameter "btl" (current value: )
  Default selection set of components for the btl
framework ( means "use all components that can be found")
 MCA btl: parameter "btl_base_verbose" (current value: "0")
  Verbosity level for the btl framework (0 = no
verbosity)
 MCA btl: parameter "btl_sm_free_list_num" (current value:
"8")
 MCA btl: parameter "btl_sm_free_list_max" (current value:
"-1")
 MCA btl: parameter "btl_sm_free_list_inc" (current value:
"64")
 MCA btl: parameter "btl_sm_exclusivity" (current value:
"65535")
 MCA btl: parameter "btl_sm_latency" (current value: "100")
 MCA btl: parameter "btl_sm_max_procs" (current value: "-1")
 MCA btl: parameter "btl_sm_sm_extra_procs" (current value:
"2")
 MCA btl: parameter "btl_sm_mpool" (current value: "sm")
 MCA btl: parameter "btl_sm_eager_limit" (current value:
"4096")
 MCA btl: parameter "btl_sm_max_frag_size" (current value:
"32768")
 MCA btl: parameter "btl_sm_size_of_cb_queue" (current
value: "128")
 MCA btl: parameter "btl_sm_cb_lazy_free_freq" (current
value: "120")
 MCA btl: parameter "btl_sm_priority" (current value: "0")
 MCA btl: parameter "btl_base_warn_component_unused"
(current value: "1")
  This parameter is used to turn on warning messages
when certain NICs are not used


2011/3/28 Ralph Castain 

> The fact that this exactly matches the time you measured with shared memory
> is suspicious. My guess is that you aren't actually using shared memory at
> all.
>
> Does your "ompi_info" output show shared memory as being available? Jeff or
> others may be able to give you some params that would let you check to see
> if sm is actually being used between those procs.
>
>
>
> On Mar 28, 2011, at 7:51 AM, Michele Marena wrote:
>
> What happens with 2 processes on the same node with tcp?
> With --mca btl self,tcp my app runs in 23s.
>
> 2011/3/28 Jeff Squyres (jsquyres) 
>
>> Ah, I didn't catch before that there were more variables than just tcp vs.
>> shmem.
>>
>> What happens with 2 processes on the same node with tcp?
>>
>> Eg, when both procs are on the same node, are you thrashing caches or
>> memory?
>>
>> Sent from my phone. No type good.
>>
>> On Mar 28, 2011, at 6:27 AM, "Michele Marena" 
>> wrote:
>>
>> However, I thank you Tim, Ralh and Jeff.
>> My sequential application runs in 24s (wall clock time).
>> My parallel application runs in 13s with two processes on different nodes.
>> With shared memory, when two processes are on the same node, my app runs
>> in 23s.
>> I'm not understand why.
>>
>> 2011/3/28 Jeff Squyres < jsquy...@cisco.com>
>>
>>> If your program runs faster across 3 processes, 2 of which are local to
>>> each other, with --mca btl tcp,self compared to --mca btl tcp,sm,self, then
>>> something is very, very strange.
>>>
>>> Tim cites all kinds of things that can cause slowdowns, but it's still
>>> very, very odd that simply enabling using the shared memory communications
>>> channel in Open MPI *slows your overall application down*.
>>>
>>> How much does your application slow down in wall clock time?  Seconds?
>>>  Minutes?  Hours?  (anything less than 1 second is in the noise)
>>>
>>>
>>>
>>> On Mar 27, 2011, at 10:33 AM, Ralph Castain wrote:
>>>
>>> >
>>> > On Mar 27, 2011, at 7:37 AM, Tim Prince wrote:
>>> >
>>> >> On 3/27/2011 2:26 AM, Michele Marena wrote:
>>> >>> Hi,
>>> >>> My application performs good without shared memory utilization, but
>>> with
>>> >>> shared memory I get performance worst than without of it.
>>> >>> Do I make a mistake? Don't I pay attention to something?
>>> >>> I know OpenMPI uses /tmp directory to allocate shared memory and it
>>> is
>>> >>> in the local filesystem.
>>> >>>
>>> >>
>>> >> I guess you mean shared memory message passing.   Among relevant
>>> parameters may be the message size where your implementation switches from
>>> cached copy to non-temporal (if you are on a platform where that terminology
>>> is used).  If built with Intel compilers, for example, the copy may be
>>> performed by intel_fast_memcpy, with a default setting which uses
>>> non-temporal when the message exceeds about some preset size, e.g. 50% of
>>> smallest L2 cache for that architecture.
>>> >> A quick search for past posts seems to indicate that OpenMPI doesn't
>>> itself invoke non-temporal, but there appear to be several useful articles
>>> not connected with 

Re: [OMPI users] Shared Memory Performance Problem.

2011-03-28 Thread Ralph Castain
The fact that this exactly matches the time you measured with shared memory is 
suspicious. My guess is that you aren't actually using shared memory at all.

Does your "ompi_info" output show shared memory as being available? Jeff or 
others may be able to give you some params that would let you check to see if 
sm is actually being used between those procs.



On Mar 28, 2011, at 7:51 AM, Michele Marena wrote:

> What happens with 2 processes on the same node with tcp?
> With --mca btl self,tcp my app runs in 23s.
> 
> 2011/3/28 Jeff Squyres (jsquyres) 
> Ah, I didn't catch before that there were more variables than just tcp vs. 
> shmem. 
> 
> What happens with 2 processes on the same node with tcp?
> 
> Eg, when both procs are on the same node, are you thrashing caches or memory?
> 
> Sent from my phone. No type good. 
> 
> On Mar 28, 2011, at 6:27 AM, "Michele Marena"  wrote:
> 
>> However, I thank you Tim, Ralh and Jeff.
>> My sequential application runs in 24s (wall clock time).
>> My parallel application runs in 13s with two processes on different nodes.
>> With shared memory, when two processes are on the same node, my app runs in 
>> 23s.
>> I'm not understand why.
>> 
>> 2011/3/28 Jeff Squyres 
>> If your program runs faster across 3 processes, 2 of which are local to each 
>> other, with --mca btl tcp,self compared to --mca btl tcp,sm,self, then 
>> something is very, very strange.
>> 
>> Tim cites all kinds of things that can cause slowdowns, but it's still very, 
>> very odd that simply enabling using the shared memory communications channel 
>> in Open MPI *slows your overall application down*.
>> 
>> How much does your application slow down in wall clock time?  Seconds?  
>> Minutes?  Hours?  (anything less than 1 second is in the noise)
>> 
>> 
>> 
>> On Mar 27, 2011, at 10:33 AM, Ralph Castain wrote:
>> 
>> >
>> > On Mar 27, 2011, at 7:37 AM, Tim Prince wrote:
>> >
>> >> On 3/27/2011 2:26 AM, Michele Marena wrote:
>> >>> Hi,
>> >>> My application performs good without shared memory utilization, but with
>> >>> shared memory I get performance worst than without of it.
>> >>> Do I make a mistake? Don't I pay attention to something?
>> >>> I know OpenMPI uses /tmp directory to allocate shared memory and it is
>> >>> in the local filesystem.
>> >>>
>> >>
>> >> I guess you mean shared memory message passing.   Among relevant 
>> >> parameters may be the message size where your implementation switches 
>> >> from cached copy to non-temporal (if you are on a platform where that 
>> >> terminology is used).  If built with Intel compilers, for example, the 
>> >> copy may be performed by intel_fast_memcpy, with a default setting which 
>> >> uses non-temporal when the message exceeds about some preset size, e.g. 
>> >> 50% of smallest L2 cache for that architecture.
>> >> A quick search for past posts seems to indicate that OpenMPI doesn't 
>> >> itself invoke non-temporal, but there appear to be several useful 
>> >> articles not connected with OpenMPI.
>> >> In case guesses aren't sufficient, it's often necessary to profile 
>> >> (gprof, oprofile, Vtune, ) to pin this down.
>> >> If shared message slows your application down, the question is whether 
>> >> this is due to excessive eviction of data from cache; not a simple 
>> >> question, as most recent CPUs have 3 levels of cache, and your 
>> >> application may require more or less data which was in use prior to the 
>> >> message receipt, and may use immediately only a small piece of a large 
>> >> message.
>> >
>> > There were several papers published in earlier years about shared memory 
>> > performance in the 1.2 series.There were known problems with that 
>> > implementation, which is why it was heavily revised for the 1.3/4 series.
>> >
>> > You might also look at the following links, though much of it has been 
>> > updated to the 1.3/4 series as we don't really support 1.2 any more:
>> >
>> > http://www.open-mpi.org/faq/?category=sm
>> >
>> > http://www.open-mpi.org/faq/?category=perftools
>> >
>> >
>> >>
>> >> --
>> >> Tim Prince
>> >> ___
>> >> users mailing list
>> >> us...@open-mpi.org
>> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >
>> >
>> > ___
>> > users mailing list
>> > us...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> 

Re: [OMPI users] Shared Memory Performance Problem.

2011-03-28 Thread Michele Marena
What happens with 2 processes on the same node with tcp?
With --mca btl self,tcp my app runs in 23s.

2011/3/28 Jeff Squyres (jsquyres) 

> Ah, I didn't catch before that there were more variables than just tcp vs.
> shmem.
>
> What happens with 2 processes on the same node with tcp?
>
> Eg, when both procs are on the same node, are you thrashing caches or
> memory?
>
> Sent from my phone. No type good.
>
> On Mar 28, 2011, at 6:27 AM, "Michele Marena" 
> wrote:
>
> However, I thank you Tim, Ralh and Jeff.
> My sequential application runs in 24s (wall clock time).
> My parallel application runs in 13s with two processes on different nodes.
> With shared memory, when two processes are on the same node, my app runs in
> 23s.
> I'm not understand why.
>
> 2011/3/28 Jeff Squyres < jsquy...@cisco.com>
>
>> If your program runs faster across 3 processes, 2 of which are local to
>> each other, with --mca btl tcp,self compared to --mca btl tcp,sm,self, then
>> something is very, very strange.
>>
>> Tim cites all kinds of things that can cause slowdowns, but it's still
>> very, very odd that simply enabling using the shared memory communications
>> channel in Open MPI *slows your overall application down*.
>>
>> How much does your application slow down in wall clock time?  Seconds?
>>  Minutes?  Hours?  (anything less than 1 second is in the noise)
>>
>>
>>
>> On Mar 27, 2011, at 10:33 AM, Ralph Castain wrote:
>>
>> >
>> > On Mar 27, 2011, at 7:37 AM, Tim Prince wrote:
>> >
>> >> On 3/27/2011 2:26 AM, Michele Marena wrote:
>> >>> Hi,
>> >>> My application performs good without shared memory utilization, but
>> with
>> >>> shared memory I get performance worst than without of it.
>> >>> Do I make a mistake? Don't I pay attention to something?
>> >>> I know OpenMPI uses /tmp directory to allocate shared memory and it is
>> >>> in the local filesystem.
>> >>>
>> >>
>> >> I guess you mean shared memory message passing.   Among relevant
>> parameters may be the message size where your implementation switches from
>> cached copy to non-temporal (if you are on a platform where that terminology
>> is used).  If built with Intel compilers, for example, the copy may be
>> performed by intel_fast_memcpy, with a default setting which uses
>> non-temporal when the message exceeds about some preset size, e.g. 50% of
>> smallest L2 cache for that architecture.
>> >> A quick search for past posts seems to indicate that OpenMPI doesn't
>> itself invoke non-temporal, but there appear to be several useful articles
>> not connected with OpenMPI.
>> >> In case guesses aren't sufficient, it's often necessary to profile
>> (gprof, oprofile, Vtune, ) to pin this down.
>> >> If shared message slows your application down, the question is whether
>> this is due to excessive eviction of data from cache; not a simple question,
>> as most recent CPUs have 3 levels of cache, and your application may require
>> more or less data which was in use prior to the message receipt, and may use
>> immediately only a small piece of a large message.
>> >
>> > There were several papers published in earlier years about shared memory
>> performance in the 1.2 series.There were known problems with that
>> implementation, which is why it was heavily revised for the 1.3/4 series.
>> >
>> > You might also look at the following links, though much of it has been
>> updated to the 1.3/4 series as we don't really support 1.2 any more:
>> >
>> > 
>> http://www.open-mpi.org/faq/?category=sm
>> >
>> > 
>> http://www.open-mpi.org/faq/?category=perftools
>> >
>> >
>> >>
>> >> --
>> >> Tim Prince
>> >> ___
>> >> users mailing list
>> >> us...@open-mpi.org
>> >> 
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >
>> >
>> > ___
>> > users mailing list
>> > us...@open-mpi.org
>> > 
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> --
>> Jeff Squyres
>>  jsquy...@cisco.com
>> For corporate legal information go to:
>>  
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>>
>> ___
>> users mailing list
>>  us...@open-mpi.org
>>  
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> 

Re: [OMPI users] Shared Memory Performance Problem.

2011-03-28 Thread Tim Prince

On 3/28/2011 3:29 AM, Michele Marena wrote:

Each node have two processors (no dual-core).

which seems to imply that the 2 processors share memory space and a 
single memory buss, and the question is not about what I originally guessed.


--
Tim Prince


Re: [OMPI users] Shared Memory Performance Problem.

2011-03-28 Thread Jeff Squyres (jsquyres)
Ah, I didn't catch before that there were more variables than just tcp vs. 
shmem. 

What happens with 2 processes on the same node with tcp?

Eg, when both procs are on the same node, are you thrashing caches or memory?

Sent from my phone. No type good. 

On Mar 28, 2011, at 6:27 AM, "Michele Marena"  wrote:

> However, I thank you Tim, Ralh and Jeff.
> My sequential application runs in 24s (wall clock time).
> My parallel application runs in 13s with two processes on different nodes.
> With shared memory, when two processes are on the same node, my app runs in 
> 23s.
> I'm not understand why.
> 
> 2011/3/28 Jeff Squyres 
> If your program runs faster across 3 processes, 2 of which are local to each 
> other, with --mca btl tcp,self compared to --mca btl tcp,sm,self, then 
> something is very, very strange.
> 
> Tim cites all kinds of things that can cause slowdowns, but it's still very, 
> very odd that simply enabling using the shared memory communications channel 
> in Open MPI *slows your overall application down*.
> 
> How much does your application slow down in wall clock time?  Seconds?  
> Minutes?  Hours?  (anything less than 1 second is in the noise)
> 
> 
> 
> On Mar 27, 2011, at 10:33 AM, Ralph Castain wrote:
> 
> >
> > On Mar 27, 2011, at 7:37 AM, Tim Prince wrote:
> >
> >> On 3/27/2011 2:26 AM, Michele Marena wrote:
> >>> Hi,
> >>> My application performs good without shared memory utilization, but with
> >>> shared memory I get performance worst than without of it.
> >>> Do I make a mistake? Don't I pay attention to something?
> >>> I know OpenMPI uses /tmp directory to allocate shared memory and it is
> >>> in the local filesystem.
> >>>
> >>
> >> I guess you mean shared memory message passing.   Among relevant 
> >> parameters may be the message size where your implementation switches from 
> >> cached copy to non-temporal (if you are on a platform where that 
> >> terminology is used).  If built with Intel compilers, for example, the 
> >> copy may be performed by intel_fast_memcpy, with a default setting which 
> >> uses non-temporal when the message exceeds about some preset size, e.g. 
> >> 50% of smallest L2 cache for that architecture.
> >> A quick search for past posts seems to indicate that OpenMPI doesn't 
> >> itself invoke non-temporal, but there appear to be several useful articles 
> >> not connected with OpenMPI.
> >> In case guesses aren't sufficient, it's often necessary to profile (gprof, 
> >> oprofile, Vtune, ) to pin this down.
> >> If shared message slows your application down, the question is whether 
> >> this is due to excessive eviction of data from cache; not a simple 
> >> question, as most recent CPUs have 3 levels of cache, and your application 
> >> may require more or less data which was in use prior to the message 
> >> receipt, and may use immediately only a small piece of a large message.
> >
> > There were several papers published in earlier years about shared memory 
> > performance in the 1.2 series.There were known problems with that 
> > implementation, which is why it was heavily revised for the 1.3/4 series.
> >
> > You might also look at the following links, though much of it has been 
> > updated to the 1.3/4 series as we don't really support 1.2 any more:
> >
> > http://www.open-mpi.org/faq/?category=sm
> >
> > http://www.open-mpi.org/faq/?category=perftools
> >
> >
> >>
> >> --
> >> Tim Prince
> >> ___
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] Shared Memory Performance Problem.

2011-03-28 Thread Michele Marena
Each node have two processors (no dual-core).

2011/3/28 Michele Marena 

> However, I thank you Tim, Ralh and Jeff.
> My sequential application runs in 24s (wall clock time).
> My parallel application runs in 13s with two processes on different nodes.
> With shared memory, when two processes are on the same node, my app runs in
> 23s.
> I'm not understand why.
>
> 2011/3/28 Jeff Squyres 
>
>> If your program runs faster across 3 processes, 2 of which are local to
>> each other, with --mca btl tcp,self compared to --mca btl tcp,sm,self, then
>> something is very, very strange.
>>
>> Tim cites all kinds of things that can cause slowdowns, but it's still
>> very, very odd that simply enabling using the shared memory communications
>> channel in Open MPI *slows your overall application down*.
>>
>> How much does your application slow down in wall clock time?  Seconds?
>>  Minutes?  Hours?  (anything less than 1 second is in the noise)
>>
>>
>>
>> On Mar 27, 2011, at 10:33 AM, Ralph Castain wrote:
>>
>> >
>> > On Mar 27, 2011, at 7:37 AM, Tim Prince wrote:
>> >
>> >> On 3/27/2011 2:26 AM, Michele Marena wrote:
>> >>> Hi,
>> >>> My application performs good without shared memory utilization, but
>> with
>> >>> shared memory I get performance worst than without of it.
>> >>> Do I make a mistake? Don't I pay attention to something?
>> >>> I know OpenMPI uses /tmp directory to allocate shared memory and it is
>> >>> in the local filesystem.
>> >>>
>> >>
>> >> I guess you mean shared memory message passing.   Among relevant
>> parameters may be the message size where your implementation switches from
>> cached copy to non-temporal (if you are on a platform where that terminology
>> is used).  If built with Intel compilers, for example, the copy may be
>> performed by intel_fast_memcpy, with a default setting which uses
>> non-temporal when the message exceeds about some preset size, e.g. 50% of
>> smallest L2 cache for that architecture.
>> >> A quick search for past posts seems to indicate that OpenMPI doesn't
>> itself invoke non-temporal, but there appear to be several useful articles
>> not connected with OpenMPI.
>> >> In case guesses aren't sufficient, it's often necessary to profile
>> (gprof, oprofile, Vtune, ) to pin this down.
>> >> If shared message slows your application down, the question is whether
>> this is due to excessive eviction of data from cache; not a simple question,
>> as most recent CPUs have 3 levels of cache, and your application may require
>> more or less data which was in use prior to the message receipt, and may use
>> immediately only a small piece of a large message.
>> >
>> > There were several papers published in earlier years about shared memory
>> performance in the 1.2 series.There were known problems with that
>> implementation, which is why it was heavily revised for the 1.3/4 series.
>> >
>> > You might also look at the following links, though much of it has been
>> updated to the 1.3/4 series as we don't really support 1.2 any more:
>> >
>> > http://www.open-mpi.org/faq/?category=sm
>> >
>> > http://www.open-mpi.org/faq/?category=perftools
>> >
>> >
>> >>
>> >> --
>> >> Tim Prince
>> >> ___
>> >> users mailing list
>> >> us...@open-mpi.org
>> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >
>> >
>> > ___
>> > users mailing list
>> > us...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>


Re: [OMPI users] Shared Memory Performance Problem.

2011-03-28 Thread Michele Marena
However, I thank you Tim, Ralh and Jeff.
My sequential application runs in 24s (wall clock time).
My parallel application runs in 13s with two processes on different nodes.
With shared memory, when two processes are on the same node, my app runs in
23s.
I'm not understand why.

2011/3/28 Jeff Squyres 

> If your program runs faster across 3 processes, 2 of which are local to
> each other, with --mca btl tcp,self compared to --mca btl tcp,sm,self, then
> something is very, very strange.
>
> Tim cites all kinds of things that can cause slowdowns, but it's still
> very, very odd that simply enabling using the shared memory communications
> channel in Open MPI *slows your overall application down*.
>
> How much does your application slow down in wall clock time?  Seconds?
>  Minutes?  Hours?  (anything less than 1 second is in the noise)
>
>
>
> On Mar 27, 2011, at 10:33 AM, Ralph Castain wrote:
>
> >
> > On Mar 27, 2011, at 7:37 AM, Tim Prince wrote:
> >
> >> On 3/27/2011 2:26 AM, Michele Marena wrote:
> >>> Hi,
> >>> My application performs good without shared memory utilization, but
> with
> >>> shared memory I get performance worst than without of it.
> >>> Do I make a mistake? Don't I pay attention to something?
> >>> I know OpenMPI uses /tmp directory to allocate shared memory and it is
> >>> in the local filesystem.
> >>>
> >>
> >> I guess you mean shared memory message passing.   Among relevant
> parameters may be the message size where your implementation switches from
> cached copy to non-temporal (if you are on a platform where that terminology
> is used).  If built with Intel compilers, for example, the copy may be
> performed by intel_fast_memcpy, with a default setting which uses
> non-temporal when the message exceeds about some preset size, e.g. 50% of
> smallest L2 cache for that architecture.
> >> A quick search for past posts seems to indicate that OpenMPI doesn't
> itself invoke non-temporal, but there appear to be several useful articles
> not connected with OpenMPI.
> >> In case guesses aren't sufficient, it's often necessary to profile
> (gprof, oprofile, Vtune, ) to pin this down.
> >> If shared message slows your application down, the question is whether
> this is due to excessive eviction of data from cache; not a simple question,
> as most recent CPUs have 3 levels of cache, and your application may require
> more or less data which was in use prior to the message receipt, and may use
> immediately only a small piece of a large message.
> >
> > There were several papers published in earlier years about shared memory
> performance in the 1.2 series.There were known problems with that
> implementation, which is why it was heavily revised for the 1.3/4 series.
> >
> > You might also look at the following links, though much of it has been
> updated to the 1.3/4 series as we don't really support 1.2 any more:
> >
> > http://www.open-mpi.org/faq/?category=sm
> >
> > http://www.open-mpi.org/faq/?category=perftools
> >
> >
> >>
> >> --
> >> Tim Prince
> >> ___
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Shared Memory Performance Problem.

2011-03-27 Thread Jeff Squyres
If your program runs faster across 3 processes, 2 of which are local to each 
other, with --mca btl tcp,self compared to --mca btl tcp,sm,self, then 
something is very, very strange.

Tim cites all kinds of things that can cause slowdowns, but it's still very, 
very odd that simply enabling using the shared memory communications channel in 
Open MPI *slows your overall application down*.

How much does your application slow down in wall clock time?  Seconds?  
Minutes?  Hours?  (anything less than 1 second is in the noise)



On Mar 27, 2011, at 10:33 AM, Ralph Castain wrote:

> 
> On Mar 27, 2011, at 7:37 AM, Tim Prince wrote:
> 
>> On 3/27/2011 2:26 AM, Michele Marena wrote:
>>> Hi,
>>> My application performs good without shared memory utilization, but with
>>> shared memory I get performance worst than without of it.
>>> Do I make a mistake? Don't I pay attention to something?
>>> I know OpenMPI uses /tmp directory to allocate shared memory and it is
>>> in the local filesystem.
>>> 
>> 
>> I guess you mean shared memory message passing.   Among relevant parameters 
>> may be the message size where your implementation switches from cached copy 
>> to non-temporal (if you are on a platform where that terminology is used).  
>> If built with Intel compilers, for example, the copy may be performed by 
>> intel_fast_memcpy, with a default setting which uses non-temporal when the 
>> message exceeds about some preset size, e.g. 50% of smallest L2 cache for 
>> that architecture.
>> A quick search for past posts seems to indicate that OpenMPI doesn't itself 
>> invoke non-temporal, but there appear to be several useful articles not 
>> connected with OpenMPI.
>> In case guesses aren't sufficient, it's often necessary to profile (gprof, 
>> oprofile, Vtune, ) to pin this down.
>> If shared message slows your application down, the question is whether this 
>> is due to excessive eviction of data from cache; not a simple question, as 
>> most recent CPUs have 3 levels of cache, and your application may require 
>> more or less data which was in use prior to the message receipt, and may use 
>> immediately only a small piece of a large message.
> 
> There were several papers published in earlier years about shared memory 
> performance in the 1.2 series.There were known problems with that 
> implementation, which is why it was heavily revised for the 1.3/4 series.
> 
> You might also look at the following links, though much of it has been 
> updated to the 1.3/4 series as we don't really support 1.2 any more:
> 
> http://www.open-mpi.org/faq/?category=sm
> 
> http://www.open-mpi.org/faq/?category=perftools
> 
> 
>> 
>> -- 
>> Tim Prince
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Shared Memory Performance Problem.

2011-03-27 Thread Ralph Castain

On Mar 27, 2011, at 7:37 AM, Tim Prince wrote:

> On 3/27/2011 2:26 AM, Michele Marena wrote:
>> Hi,
>> My application performs good without shared memory utilization, but with
>> shared memory I get performance worst than without of it.
>> Do I make a mistake? Don't I pay attention to something?
>> I know OpenMPI uses /tmp directory to allocate shared memory and it is
>> in the local filesystem.
>> 
> 
> I guess you mean shared memory message passing.   Among relevant parameters 
> may be the message size where your implementation switches from cached copy 
> to non-temporal (if you are on a platform where that terminology is used).  
> If built with Intel compilers, for example, the copy may be performed by 
> intel_fast_memcpy, with a default setting which uses non-temporal when the 
> message exceeds about some preset size, e.g. 50% of smallest L2 cache for 
> that architecture.
> A quick search for past posts seems to indicate that OpenMPI doesn't itself 
> invoke non-temporal, but there appear to be several useful articles not 
> connected with OpenMPI.
> In case guesses aren't sufficient, it's often necessary to profile (gprof, 
> oprofile, Vtune, ) to pin this down.
> If shared message slows your application down, the question is whether this 
> is due to excessive eviction of data from cache; not a simple question, as 
> most recent CPUs have 3 levels of cache, and your application may require 
> more or less data which was in use prior to the message receipt, and may use 
> immediately only a small piece of a large message.

There were several papers published in earlier years about shared memory 
performance in the 1.2 series.There were known problems with that 
implementation, which is why it was heavily revised for the 1.3/4 series.

You might also look at the following links, though much of it has been updated 
to the 1.3/4 series as we don't really support 1.2 any more:

http://www.open-mpi.org/faq/?category=sm

http://www.open-mpi.org/faq/?category=perftools


> 
> -- 
> Tim Prince
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Shared Memory Performance Problem.

2011-03-27 Thread Tim Prince

On 3/27/2011 2:26 AM, Michele Marena wrote:

Hi,
My application performs good without shared memory utilization, but with
shared memory I get performance worst than without of it.
Do I make a mistake? Don't I pay attention to something?
I know OpenMPI uses /tmp directory to allocate shared memory and it is
in the local filesystem.



I guess you mean shared memory message passing.   Among relevant 
parameters may be the message size where your implementation switches 
from cached copy to non-temporal (if you are on a platform where that 
terminology is used).  If built with Intel compilers, for example, the 
copy may be performed by intel_fast_memcpy, with a default setting which 
uses non-temporal when the message exceeds about some preset size, e.g. 
50% of smallest L2 cache for that architecture.
A quick search for past posts seems to indicate that OpenMPI doesn't 
itself invoke non-temporal, but there appear to be several useful 
articles not connected with OpenMPI.
In case guesses aren't sufficient, it's often necessary to profile 
(gprof, oprofile, Vtune, ) to pin this down.
If shared message slows your application down, the question is whether 
this is due to excessive eviction of data from cache; not a simple 
question, as most recent CPUs have 3 levels of cache, and your 
application may require more or less data which was in use prior to the 
message receipt, and may use immediately only a small piece of a large 
message.


--
Tim Prince


Re: [OMPI users] Shared Memory Performance Problem.

2011-03-27 Thread Michele Marena
This is my machinefile
node-1-16 slots=2
node-1-17 slots=2
node-1-18 slots=2
node-1-19 slots=2
node-1-20 slots=2
node-1-21 slots=2
node-1-22 slots=2
node-1-23 slots=2

Each cluster node has 2 processors. I launch my application with 3
processes, one on node-1-16 (manager) and two on node-1-17(workers). Two
processes on node-1-17 communicate each other.

2011/3/27 Michele Marena 

> Hi,
> My application performs good without shared memory utilization, but with
> shared memory I get performance worst than without of it.
> Do I make a mistake? Don't I pay attention to something?
> I know OpenMPI uses /tmp directory to allocate shared memory and it is in
> the local filesystem.
>
> I thank you.
> Michele.
>


[OMPI users] Shared Memory Performance Problem.

2011-03-27 Thread Michele Marena
Hi,
My application performs good without shared memory utilization, but with
shared memory I get performance worst than without of it.
Do I make a mistake? Don't I pay attention to something?
I know OpenMPI uses /tmp directory to allocate shared memory and it is in
the local filesystem.

I thank you.
Michele.