Re: [OMPI users] Bad performance when scattering big size of data?

2010-10-04 Thread Eugene Loh

Storm Zhang wrote:



Here is what I meant: the results of 500 procs in fact shows it with 
272-304(<500) real cores, the program's running time is good, which is 
almost five times 100 procs' time. So it can be handled very well. 
Therefore I guess OpenMPI or Rocks OS does make use of hyperthreading 
to do the job. But with 600 procs, the running time is more than 
double of that of 500 procs. I don't know why. This is my problem.  

BTW, how to use -bind-to-core? I added it as mpirun's options. It 
always gives me error " the executable 'bind-to-core' can't be found. 
Isn't it like:

mpirun --mca btl_tcp_if_include eth0 -np 600  -bind-to-core scatttest


Thanks for sending the mpirun run and error message.  That helps.

It's not recognizing the --bind-to-core option.  (Single hyphen, as you 
had, should also be okay.)  Skimming through the e-mail, it looks like 
you are using OMPI 1.3.2 and 1.4.2.  Did you try --bind-to-core with 
both?  If I remember my version numbers, --bind-to-core will not be 
recognized with 1.3.2, but should be with 1.4.2.  Could it be that you 
only tried 1.3.2?


Another option is to try "mpirun --help".  Make sure that it reports 
--bind-to-core.


Re: [OMPI users] Bad performance when scattering big size of data?

2010-10-04 Thread Storm Zhang
Here is what I meant: the results of 500 procs in fact shows it with
272-304(<500) real cores, the program's running time is good, which is
almost five times 100 procs' time. So it can be handled very well. Therefore
I guess OpenMPI or Rocks OS does make use of hyperthreading to do the job.
But with 600 procs, the running time is more than double of that of 500
procs. I don't know why. This is my problem.

BTW, how to use -bind-to-core? I added it as mpirun's options. It always
gives me error " the executable 'bind-to-core' can't be found. Isn't it
like:
mpirun --mca btl_tcp_if_include eth0 -np 600  -bind-to-core scatttest

Thank you very much.

Linbao

On Mon, Oct 4, 2010 at 4:42 PM, Ralph Castain  wrote:

>
> On Oct 4, 2010, at 1:48 PM, Storm Zhang wrote:
>
> Thanks a lot, Ralgh. As I said, I also tried to use SGE(also showing 1024
> available for parallel tasks) which only assign 34-38 compute nodes which
> only has 272-304 real cores for 500 procs running. The running time is
> consistent with 100 procs and not a lot fluctuations due to the number of
> machines' changing.
>
>
> Afraid I don't understand your statement. If you have 500 procs running on
> < 500 cores, then the performance relative to a high-performance job (#procs
> <= #cores) will be worse. We deliberately dial down the performance when
> oversubscribed to ensure that procs "play nice" in situations where the node
> is oversubscribed.
>
>  So I guess it is not related to hyperthreading. Correct me if I'm wrong.
>
>
> Has nothing to do with hyperthreading - OMPI has no knowledge of
> hyperthreads at this time.
>
>
> BTW, how to bind the proc to the core? I tried --bind-to-core or
> -bind-to-core but neither works. Is it for OpenMP, not for OpenMPI?
>
>
> Those should work. You might try --report-bindings to see what OMPI thought
> it did.
>
>
> Linbao
>
>
> On Mon, Oct 4, 2010 at 12:27 PM, Ralph Castain  wrote:
>
>> Some of what you are seeing is the natural result of context
>> switchingsome thoughts regarding the results:
>>
>> 1. You didn't bind your procs to cores when running with #procs < #cores,
>> so you're performance in those scenarios will also be less than max.
>>
>> 2. Once the number of procs exceeds the number of cores, you guarantee a
>> lot of context switching, so performance will definitely take a hit.
>>
>> 3. Sometime in the not-too-distant-future, OMPI will (hopefully) become
>> hyperthread aware. For now, we don't see them as separate processing units.
>> So as far as OMPI is concerned, you only have 512 computing units to work
>> with, not 1024.
>>
>> Bottom line is that you are running oversubscribed, so OMPI turns down
>> your performance so that the machine doesn't hemorrhage as it context
>> switches.
>>
>>
>> On Oct 4, 2010, at 11:06 AM, Doug Reeder wrote:
>>
>> In my experience hyperthreading can't really deliver two cores worth of
>> processing simultaneously for processes expecting sole use of a core. Since
>> you really have 512 cores I'm not surprised that you see a performance hit
>> when requesting > 512 compute units. We should really get input from a
>> hyperthreading expert, preferably form intel.
>>
>> Doug Reeder
>> On Oct 4, 2010, at 9:53 AM, Storm Zhang wrote:
>>
>> We have 64 compute nodes which are dual qual-core and hyperthreaded CPUs.
>> So we have 1024 compute units shown in the ROCKS 5.3 system. I'm trying to
>> scatter an array from the master node to the compute nodes using mpiCC and
>> mpirun using C++.
>>
>> Here is my test:
>>
>> The array size is 18KB * Number of compute nodes and is scattered to the
>> compute nodes 5000 times repeatly.
>>
>> The average running time(seconds):
>>
>> 100 nodes: 170,
>> 400 nodes: 690,
>> 500 nodes: 855,
>> 600 nodes: 2550,
>> 700 nodes: 2720,
>> 800 nodes: 2900,
>>
>> There is a big jump of running time from 500 nodes to 600 nodes. Don't
>> know what's the problem.
>> Tried both in OMPI 1.3.2 and OMPI 1.4.2. Running time is a little faster
>> for all the tests in 1.4.2 but the jump still exists.
>> Tried using either Bcast function or simply Send/Recv which give very
>> close results.
>> Tried both in running it directly or using SGE and got the same results.
>>
>> The code and ompi_info are attached to this email. The direct running
>> command is :
>> /opt/openmpi/bin/mpirun --mca btl_tcp_if_include eth0 --machinefile
>> ../machines -np 600 scatttest
>>
>> The ifconfig of head node for eth0 is:
>> eth0  Link encap:Ethernet  HWaddr 00:26:B9:56:8B:44
>>   inet addr:192.168.1.1  Bcast:192.168.1.255  Mask:255.255.255.0
>>   inet6 addr: fe80::226:b9ff:fe56:8b44/64 Scope:Link
>>   UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>>   RX packets:1096060373 errors:0 dropped:2512622 overruns:0
>> frame:0
>>   TX packets:513387679 errors:0 dropped:0 overruns:0 carrier:0
>>   collisions:0 txqueuelen:1000
>>   RX bytes:832328807459 (775.1 GiB)  TX bytes:250824621959 (233.5
>> GiB)
>>   In

[OMPI users] location of ompi libraries

2010-10-04 Thread David Turner

Hi,

In Open MPI 1.4.1, the directory lib/openmpi contains about 130
entries, including such things as mca_btl_openib.so.  In my
build of Open MPI 1.4.2, lib/openmpi contains exactly three
items:
libompi_dbg_msgq.a  libompi_dbg_msgq.la  libompi_dbg_msgq.so

I have searched my 1.4.2 installation for mca_btl_openib.so,
to no avail.  And yet, 1.4.2 seems to work "fine".  Is my
installation broken, or is the organization significantly
different between the two versions?  A quick scan of the
release notes didn't help.

Thanks!

--
Best regards,

David Turner
User Services Groupemail: dptur...@lbl.gov
NERSC Division phone: (510) 486-4027
Lawrence Berkeley Labfax: (510) 486-4316


Re: [OMPI users] Bad performance when scattering big size of data?

2010-10-04 Thread Ralph Castain

On Oct 4, 2010, at 1:48 PM, Storm Zhang wrote:

> Thanks a lot, Ralgh. As I said, I also tried to use SGE(also showing 1024 
> available for parallel tasks) which only assign 34-38 compute nodes which 
> only has 272-304 real cores for 500 procs running. The running time is 
> consistent with 100 procs and not a lot fluctuations due to the number of 
> machines' changing.

Afraid I don't understand your statement. If you have 500 procs running on < 
500 cores, then the performance relative to a high-performance job (#procs <= 
#cores) will be worse. We deliberately dial down the performance when 
oversubscribed to ensure that procs "play nice" in situations where the node is 
oversubscribed.

>  So I guess it is not related to hyperthreading. Correct me if I'm wrong.

Has nothing to do with hyperthreading - OMPI has no knowledge of hyperthreads 
at this time.

> 
> BTW, how to bind the proc to the core? I tried --bind-to-core or 
> -bind-to-core but neither works. Is it for OpenMP, not for OpenMPI? 

Those should work. You might try --report-bindings to see what OMPI thought it 
did.

> 
> Linbao
> 
> 
> On Mon, Oct 4, 2010 at 12:27 PM, Ralph Castain  wrote:
> Some of what you are seeing is the natural result of context 
> switchingsome thoughts regarding the results:
> 
> 1. You didn't bind your procs to cores when running with #procs < #cores, so 
> you're performance in those scenarios will also be less than max. 
> 
> 2. Once the number of procs exceeds the number of cores, you guarantee a lot 
> of context switching, so performance will definitely take a hit.
> 
> 3. Sometime in the not-too-distant-future, OMPI will (hopefully) become 
> hyperthread aware. For now, we don't see them as separate processing units. 
> So as far as OMPI is concerned, you only have 512 computing units to work 
> with, not 1024.
> 
> Bottom line is that you are running oversubscribed, so OMPI turns down your 
> performance so that the machine doesn't hemorrhage as it context switches.
> 
> 
> On Oct 4, 2010, at 11:06 AM, Doug Reeder wrote:
> 
>> In my experience hyperthreading can't really deliver two cores worth of 
>> processing simultaneously for processes expecting sole use of a core. Since 
>> you really have 512 cores I'm not surprised that you see a performance hit 
>> when requesting > 512 compute units. We should really get input from a 
>> hyperthreading expert, preferably form intel.
>> 
>> Doug Reeder
>> On Oct 4, 2010, at 9:53 AM, Storm Zhang wrote:
>> 
>>> We have 64 compute nodes which are dual qual-core and hyperthreaded CPUs. 
>>> So we have 1024 compute units shown in the ROCKS 5.3 system. I'm trying to 
>>> scatter an array from the master node to the compute nodes using mpiCC and 
>>> mpirun using C++. 
>>> 
>>> Here is my test:
>>> 
>>> The array size is 18KB * Number of compute nodes and is scattered to the 
>>> compute nodes 5000 times repeatly. 
>>> 
>>> The average running time(seconds):
>>> 
>>> 100 nodes: 170,
>>> 400 nodes: 690,
>>> 500 nodes: 855,
>>> 600 nodes: 2550,
>>> 700 nodes: 2720,
>>> 800 nodes: 2900,
>>> 
>>> There is a big jump of running time from 500 nodes to 600 nodes. Don't know 
>>> what's the problem. 
>>> Tried both in OMPI 1.3.2 and OMPI 1.4.2. Running time is a little faster 
>>> for all the tests in 1.4.2 but the jump still exists. 
>>> Tried using either Bcast function or simply Send/Recv which give very close 
>>> results. 
>>> Tried both in running it directly or using SGE and got the same results.
>>> 
>>> The code and ompi_info are attached to this email. The direct running 
>>> command is :
>>> /opt/openmpi/bin/mpirun --mca btl_tcp_if_include eth0 --machinefile 
>>> ../machines -np 600 scatttest
>>> 
>>> The ifconfig of head node for eth0 is:
>>> eth0  Link encap:Ethernet  HWaddr 00:26:B9:56:8B:44  
>>>   inet addr:192.168.1.1  Bcast:192.168.1.255  Mask:255.255.255.0
>>>   inet6 addr: fe80::226:b9ff:fe56:8b44/64 Scope:Link
>>>   UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>>>   RX packets:1096060373 errors:0 dropped:2512622 overruns:0 frame:0
>>>   TX packets:513387679 errors:0 dropped:0 overruns:0 carrier:0
>>>   collisions:0 txqueuelen:1000 
>>>   RX bytes:832328807459 (775.1 GiB)  TX bytes:250824621959 (233.5 
>>> GiB)
>>>   Interrupt:106 Memory:d600-d6012800 
>>> 
>>> A typical ifconfig of a compute node is:
>>> eth0  Link encap:Ethernet  HWaddr 00:21:9B:9A:15:AC  
>>>   inet addr:192.168.1.253  Bcast:192.168.1.255  Mask:255.255.255.0
>>>   inet6 addr: fe80::221:9bff:fe9a:15ac/64 Scope:Link
>>>   UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>>>   RX packets:362716422 errors:0 dropped:0 overruns:0 frame:0
>>>   TX packets:349967746 errors:0 dropped:0 overruns:0 carrier:0
>>>   collisions:0 txqueuelen:1000 
>>>   RX bytes:139699954685 (130.1 GiB)  TX bytes:338207741480 (314.9 
>>> GiB)
>>>   Interrupt:82 Memory:d

Re: [OMPI users] Bad performance when scattering big size of data?

2010-10-04 Thread Storm Zhang
Thanks a lot, Ralgh. As I said, I also tried to use SGE(also showing 1024
available for parallel tasks) which only assign 34-38 compute nodes which
only has 272-304 real cores for 500 procs running. The running time is
consistent with 100 procs and not a lot fluctuations due to the number of
machines' changing. So I guess it is not related to hyperthreading. Correct
me if I'm wrong.

BTW, how to bind the proc to the core? I tried --bind-to-core or
-bind-to-core but neither works. Is it for OpenMP, not for OpenMPI?

Linbao


On Mon, Oct 4, 2010 at 12:27 PM, Ralph Castain  wrote:

> Some of what you are seeing is the natural result of context
> switchingsome thoughts regarding the results:
>
> 1. You didn't bind your procs to cores when running with #procs < #cores,
> so you're performance in those scenarios will also be less than max.
>
> 2. Once the number of procs exceeds the number of cores, you guarantee a
> lot of context switching, so performance will definitely take a hit.
>
> 3. Sometime in the not-too-distant-future, OMPI will (hopefully) become
> hyperthread aware. For now, we don't see them as separate processing units.
> So as far as OMPI is concerned, you only have 512 computing units to work
> with, not 1024.
>
> Bottom line is that you are running oversubscribed, so OMPI turns down your
> performance so that the machine doesn't hemorrhage as it context switches.
>
>
> On Oct 4, 2010, at 11:06 AM, Doug Reeder wrote:
>
> In my experience hyperthreading can't really deliver two cores worth of
> processing simultaneously for processes expecting sole use of a core. Since
> you really have 512 cores I'm not surprised that you see a performance hit
> when requesting > 512 compute units. We should really get input from a
> hyperthreading expert, preferably form intel.
>
> Doug Reeder
> On Oct 4, 2010, at 9:53 AM, Storm Zhang wrote:
>
> We have 64 compute nodes which are dual qual-core and hyperthreaded CPUs.
> So we have 1024 compute units shown in the ROCKS 5.3 system. I'm trying to
> scatter an array from the master node to the compute nodes using mpiCC and
> mpirun using C++.
>
> Here is my test:
>
> The array size is 18KB * Number of compute nodes and is scattered to the
> compute nodes 5000 times repeatly.
>
> The average running time(seconds):
>
> 100 nodes: 170,
> 400 nodes: 690,
> 500 nodes: 855,
> 600 nodes: 2550,
> 700 nodes: 2720,
> 800 nodes: 2900,
>
> There is a big jump of running time from 500 nodes to 600 nodes. Don't
> know what's the problem.
> Tried both in OMPI 1.3.2 and OMPI 1.4.2. Running time is a little faster
> for all the tests in 1.4.2 but the jump still exists.
> Tried using either Bcast function or simply Send/Recv which give very close
> results.
> Tried both in running it directly or using SGE and got the same results.
>
> The code and ompi_info are attached to this email. The direct running
> command is :
> /opt/openmpi/bin/mpirun --mca btl_tcp_if_include eth0 --machinefile
> ../machines -np 600 scatttest
>
> The ifconfig of head node for eth0 is:
> eth0  Link encap:Ethernet  HWaddr 00:26:B9:56:8B:44
>   inet addr:192.168.1.1  Bcast:192.168.1.255  Mask:255.255.255.0
>   inet6 addr: fe80::226:b9ff:fe56:8b44/64 Scope:Link
>   UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>   RX packets:1096060373 errors:0 dropped:2512622 overruns:0 frame:0
>   TX packets:513387679 errors:0 dropped:0 overruns:0 carrier:0
>   collisions:0 txqueuelen:1000
>   RX bytes:832328807459 (775.1 GiB)  TX bytes:250824621959 (233.5
> GiB)
>   Interrupt:106 Memory:d600-d6012800
>
> A typical ifconfig of a compute node is:
> eth0  Link encap:Ethernet  HWaddr 00:21:9B:9A:15:AC
>   inet addr:192.168.1.253  Bcast:192.168.1.255  Mask:255.255.255.0
>   inet6 addr: fe80::221:9bff:fe9a:15ac/64 Scope:Link
>   UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>   RX packets:362716422 errors:0 dropped:0 overruns:0 frame:0
>   TX packets:349967746 errors:0 dropped:0 overruns:0 carrier:0
>   collisions:0 txqueuelen:1000
>   RX bytes:139699954685 (130.1 GiB)  TX bytes:338207741480 (314.9
> GiB)
>   Interrupt:82 Memory:d600-d6012800
>
>
> Does anyone help me out of this? It bothers me a lot.
>
> Thank you very much.
>
> Linbao
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] mpi_comm_spawn have problems with group communicators

2010-10-04 Thread Milan Hodoscek
> "Ralph" == Ralph Castain  writes:

Ralph> On Oct 4, 2010, at 10:36 AM, Milan Hodoscek wrote:

>>> "Ralph" == Ralph Castain  writes:
>> 
Ralph> I'm not sure why the group communicator would make a
Ralph> difference - the code area in question knows nothing about
Ralph> the mpi aspects of the job. It looks like you are hitting a
Ralph> race condition that causes a particular internal recv to
Ralph> not exist when we subsequently try to cancel it, which
Ralph> generates that error message.  How did you configure OMPI?
>> 
>> Thank you for the reply!
>> 
>> Must be some race problem, but I have no control of it, or do
>> I?

Ralph> Not really. What I don't understand is why your code would
Ralph> work fine when using comm_world, but encounter a race
Ralph> condition when using comm groups. There shouldn't be any
Ralph> timing difference between the two cases.

Fixing race condition is sometime easy by puting some variables into
the arrays. I just did for one of them but it didn't help. I'll do
some more testing in this direction, but I am running out of ideas.
When you put ngrp=1 and uncomment the other mpi_comm_spawn line in the
program you basically get only one spawn, so no opportunity for race
condition. But in my real project I usually work with many spawn
calls, however all using mpi_comm_world, but running different
programs, etc. And that always works. This time I want to localize
mpi_comm_spawns by similar trick that is in the program I sent. So
this small test case is a good model of what I would like to have.
I studied the MPI-2 standard and I think I got it right, but one never
knows...

Ralph> I'll have to take a look and see if I can spot something in
Ralph> the code...

Thanks a lot -- Milan


Re: [OMPI users] Bad performance when scattering big size of data?

2010-10-04 Thread Ralph Castain
Some of what you are seeing is the natural result of context switchingsome 
thoughts regarding the results:

1. You didn't bind your procs to cores when running with #procs < #cores, so 
you're performance in those scenarios will also be less than max. 

2. Once the number of procs exceeds the number of cores, you guarantee a lot of 
context switching, so performance will definitely take a hit.

3. Sometime in the not-too-distant-future, OMPI will (hopefully) become 
hyperthread aware. For now, we don't see them as separate processing units. So 
as far as OMPI is concerned, you only have 512 computing units to work with, 
not 1024.

Bottom line is that you are running oversubscribed, so OMPI turns down your 
performance so that the machine doesn't hemorrhage as it context switches.


On Oct 4, 2010, at 11:06 AM, Doug Reeder wrote:

> In my experience hyperthreading can't really deliver two cores worth of 
> processing simultaneously for processes expecting sole use of a core. Since 
> you really have 512 cores I'm not surprised that you see a performance hit 
> when requesting > 512 compute units. We should really get input from a 
> hyperthreading expert, preferably form intel.
> 
> Doug Reeder
> On Oct 4, 2010, at 9:53 AM, Storm Zhang wrote:
> 
>> We have 64 compute nodes which are dual qual-core and hyperthreaded CPUs. So 
>> we have 1024 compute units shown in the ROCKS 5.3 system. I'm trying to 
>> scatter an array from the master node to the compute nodes using mpiCC and 
>> mpirun using C++. 
>> 
>> Here is my test:
>> 
>> The array size is 18KB * Number of compute nodes and is scattered to the 
>> compute nodes 5000 times repeatly. 
>> 
>> The average running time(seconds):
>> 
>> 100 nodes: 170,
>> 400 nodes: 690,
>> 500 nodes: 855,
>> 600 nodes: 2550,
>> 700 nodes: 2720,
>> 800 nodes: 2900,
>> 
>> There is a big jump of running time from 500 nodes to 600 nodes. Don't know 
>> what's the problem. 
>> Tried both in OMPI 1.3.2 and OMPI 1.4.2. Running time is a little faster for 
>> all the tests in 1.4.2 but the jump still exists. 
>> Tried using either Bcast function or simply Send/Recv which give very close 
>> results. 
>> Tried both in running it directly or using SGE and got the same results.
>> 
>> The code and ompi_info are attached to this email. The direct running 
>> command is :
>> /opt/openmpi/bin/mpirun --mca btl_tcp_if_include eth0 --machinefile 
>> ../machines -np 600 scatttest
>> 
>> The ifconfig of head node for eth0 is:
>> eth0  Link encap:Ethernet  HWaddr 00:26:B9:56:8B:44  
>>   inet addr:192.168.1.1  Bcast:192.168.1.255  Mask:255.255.255.0
>>   inet6 addr: fe80::226:b9ff:fe56:8b44/64 Scope:Link
>>   UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>>   RX packets:1096060373 errors:0 dropped:2512622 overruns:0 frame:0
>>   TX packets:513387679 errors:0 dropped:0 overruns:0 carrier:0
>>   collisions:0 txqueuelen:1000 
>>   RX bytes:832328807459 (775.1 GiB)  TX bytes:250824621959 (233.5 
>> GiB)
>>   Interrupt:106 Memory:d600-d6012800 
>> 
>> A typical ifconfig of a compute node is:
>> eth0  Link encap:Ethernet  HWaddr 00:21:9B:9A:15:AC  
>>   inet addr:192.168.1.253  Bcast:192.168.1.255  Mask:255.255.255.0
>>   inet6 addr: fe80::221:9bff:fe9a:15ac/64 Scope:Link
>>   UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>>   RX packets:362716422 errors:0 dropped:0 overruns:0 frame:0
>>   TX packets:349967746 errors:0 dropped:0 overruns:0 carrier:0
>>   collisions:0 txqueuelen:1000 
>>   RX bytes:139699954685 (130.1 GiB)  TX bytes:338207741480 (314.9 
>> GiB)
>>   Interrupt:82 Memory:d600-d6012800 
>> 
>> 
>> Does anyone help me out of this? It bothers me a lot.
>> 
>> Thank you very much.
>> 
>> Linbao
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] Bad performance when scattering big size of data?

2010-10-04 Thread Storm Zhang
Thanks a lot for your reply, Doug.

There is one more thing I forgot to mention. For 500 nodes test, I observe
if I use SGE, it runs only almost half of our cluster, like 35-38 nodes, not
uniformly distributed on the whole cluster but the running time is still
good.  So I guess it is not a hyperthreading problem.

Linbao

On Mon, Oct 4, 2010 at 12:06 PM, Doug Reeder  wrote:

> In my experience hyperthreading can't really deliver two cores worth of
> processing simultaneously for processes expecting sole use of a core. Since
> you really have 512 cores I'm not surprised that you see a performance hit
> when requesting > 512 compute units. We should really get input from a
> hyperthreading expert, preferably form intel.
>
> Doug Reeder
> On Oct 4, 2010, at 9:53 AM, Storm Zhang wrote:
>
> We have 64 compute nodes which are dual qual-core and hyperthreaded CPUs.
> So we have 1024 compute units shown in the ROCKS 5.3 system. I'm trying to
> scatter an array from the master node to the compute nodes using mpiCC and
> mpirun using C++.
>
> Here is my test:
>
> The array size is 18KB * Number of compute nodes and is scattered to the
> compute nodes 5000 times repeatly.
>
> The average running time(seconds):
>
> 100 nodes: 170,
> 400 nodes: 690,
> 500 nodes: 855,
> 600 nodes: 2550,
> 700 nodes: 2720,
> 800 nodes: 2900,
>
> There is a big jump of running time from 500 nodes to 600 nodes. Don't
> know what's the problem.
> Tried both in OMPI 1.3.2 and OMPI 1.4.2. Running time is a little faster
> for all the tests in 1.4.2 but the jump still exists.
> Tried using either Bcast function or simply Send/Recv which give very close
> results.
> Tried both in running it directly or using SGE and got the same results.
>
> The code and ompi_info are attached to this email. The direct running
> command is :
> /opt/openmpi/bin/mpirun --mca btl_tcp_if_include eth0 --machinefile
> ../machines -np 600 scatttest
>
> The ifconfig of head node for eth0 is:
> eth0  Link encap:Ethernet  HWaddr 00:26:B9:56:8B:44
>   inet addr:192.168.1.1  Bcast:192.168.1.255  Mask:255.255.255.0
>   inet6 addr: fe80::226:b9ff:fe56:8b44/64 Scope:Link
>   UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>   RX packets:1096060373 errors:0 dropped:2512622 overruns:0 frame:0
>   TX packets:513387679 errors:0 dropped:0 overruns:0 carrier:0
>   collisions:0 txqueuelen:1000
>   RX bytes:832328807459 (775.1 GiB)  TX bytes:250824621959 (233.5
> GiB)
>   Interrupt:106 Memory:d600-d6012800
>
> A typical ifconfig of a compute node is:
> eth0  Link encap:Ethernet  HWaddr 00:21:9B:9A:15:AC
>   inet addr:192.168.1.253  Bcast:192.168.1.255  Mask:255.255.255.0
>   inet6 addr: fe80::221:9bff:fe9a:15ac/64 Scope:Link
>   UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>   RX packets:362716422 errors:0 dropped:0 overruns:0 frame:0
>   TX packets:349967746 errors:0 dropped:0 overruns:0 carrier:0
>   collisions:0 txqueuelen:1000
>   RX bytes:139699954685 (130.1 GiB)  TX bytes:338207741480 (314.9
> GiB)
>   Interrupt:82 Memory:d600-d6012800
>
>
> Does anyone help me out of this? It bothers me a lot.
>
> Thank you very much.
>
> Linbao
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Bad performance when scattering big size of data?

2010-10-04 Thread Doug Reeder
In my experience hyperthreading can't really deliver two cores worth of 
processing simultaneously for processes expecting sole use of a core. Since you 
really have 512 cores I'm not surprised that you see a performance hit when 
requesting > 512 compute units. We should really get input from a 
hyperthreading expert, preferably form intel.

Doug Reeder
On Oct 4, 2010, at 9:53 AM, Storm Zhang wrote:

> We have 64 compute nodes which are dual qual-core and hyperthreaded CPUs. So 
> we have 1024 compute units shown in the ROCKS 5.3 system. I'm trying to 
> scatter an array from the master node to the compute nodes using mpiCC and 
> mpirun using C++. 
> 
> Here is my test:
> 
> The array size is 18KB * Number of compute nodes and is scattered to the 
> compute nodes 5000 times repeatly. 
> 
> The average running time(seconds):
> 
> 100 nodes: 170,
> 400 nodes: 690,
> 500 nodes: 855,
> 600 nodes: 2550,
> 700 nodes: 2720,
> 800 nodes: 2900,
> 
> There is a big jump of running time from 500 nodes to 600 nodes. Don't know 
> what's the problem. 
> Tried both in OMPI 1.3.2 and OMPI 1.4.2. Running time is a little faster for 
> all the tests in 1.4.2 but the jump still exists. 
> Tried using either Bcast function or simply Send/Recv which give very close 
> results. 
> Tried both in running it directly or using SGE and got the same results.
> 
> The code and ompi_info are attached to this email. The direct running command 
> is :
> /opt/openmpi/bin/mpirun --mca btl_tcp_if_include eth0 --machinefile 
> ../machines -np 600 scatttest
> 
> The ifconfig of head node for eth0 is:
> eth0  Link encap:Ethernet  HWaddr 00:26:B9:56:8B:44  
>   inet addr:192.168.1.1  Bcast:192.168.1.255  Mask:255.255.255.0
>   inet6 addr: fe80::226:b9ff:fe56:8b44/64 Scope:Link
>   UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>   RX packets:1096060373 errors:0 dropped:2512622 overruns:0 frame:0
>   TX packets:513387679 errors:0 dropped:0 overruns:0 carrier:0
>   collisions:0 txqueuelen:1000 
>   RX bytes:832328807459 (775.1 GiB)  TX bytes:250824621959 (233.5 GiB)
>   Interrupt:106 Memory:d600-d6012800 
> 
> A typical ifconfig of a compute node is:
> eth0  Link encap:Ethernet  HWaddr 00:21:9B:9A:15:AC  
>   inet addr:192.168.1.253  Bcast:192.168.1.255  Mask:255.255.255.0
>   inet6 addr: fe80::221:9bff:fe9a:15ac/64 Scope:Link
>   UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>   RX packets:362716422 errors:0 dropped:0 overruns:0 frame:0
>   TX packets:349967746 errors:0 dropped:0 overruns:0 carrier:0
>   collisions:0 txqueuelen:1000 
>   RX bytes:139699954685 (130.1 GiB)  TX bytes:338207741480 (314.9 GiB)
>   Interrupt:82 Memory:d600-d6012800 
> 
> 
> Does anyone help me out of this? It bothers me a lot.
> 
> Thank you very much.
> 
> Linbao
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] mpi_comm_spawn have problems with group communicators

2010-10-04 Thread Ralph Castain

On Oct 4, 2010, at 10:36 AM, Milan Hodoscek wrote:

>> "Ralph" == Ralph Castain  writes:
> 
>Ralph> I'm not sure why the group communicator would make a
>Ralph> difference - the code area in question knows nothing about
>Ralph> the mpi aspects of the job. It looks like you are hitting a
>Ralph> race condition that causes a particular internal recv to
>Ralph> not exist when we subsequently try to cancel it, which
>Ralph> generates that error message.  How did you configure OMPI?
> 
> Thank you for the reply!
> 
> Must be some race problem, but I have no control of it, or do I?

Not really. What I don't understand is why your code would work fine when using 
comm_world, but encounter a race condition when using comm groups. There 
shouldn't be any timing difference between the two cases.

> 
> These are the configure options that gentoo compiles openmpi-1.4.2 with:
> 
> ./configure --prefix=/usr --build=x86_64-pc-linux-gnu 
> --host=x86_64-pc-linux-gnu --mandir=/usr/share/man --infodir=/usr/share/info 
> --datadir=/usr/share --sysconfdir=/etc --localstatedir=/var/lib 
> --libdir=/usr/lib64 --sysconfdir=/etc/openmpi --without-xgrid 
> --enable-pretty-print-stacktrace --enable-orterun-prefix-by-default 
> --without-slurm --enable-contrib-no-build=vt --enable-mpi-cxx 
> --disable-io-romio --disable-heterogeneous --without-tm --enable-ipv6
> 

This looks okay.

I'll have to take a look and see if I can spot something in the code...




[OMPI users] Bad performance when scattering big size of data?

2010-10-04 Thread Storm Zhang
We have 64 compute nodes which are dual qual-core and hyperthreaded CPUs. So
we have 1024 compute units shown in the ROCKS 5.3 system. I'm trying to
scatter an array from the master node to the compute nodes using mpiCC and
mpirun using C++.

Here is my test:

The array size is 18KB * Number of compute nodes and is scattered to the
compute nodes 5000 times repeatly.

The average running time(seconds):

100 nodes: 170,
400 nodes: 690,
500 nodes: 855,
600 nodes: 2550,
700 nodes: 2720,
800 nodes: 2900,

There is a big jump of running time from 500 nodes to 600 nodes. Don't know
what's the problem.
Tried both in OMPI 1.3.2 and OMPI 1.4.2. Running time is a little faster for
all the tests in 1.4.2 but the jump still exists.
Tried using either Bcast function or simply Send/Recv which give very close
results.
Tried both in running it directly or using SGE and got the same results.

The code and ompi_info are attached to this email. The direct running
command is :
/opt/openmpi/bin/mpirun --mca btl_tcp_if_include eth0 --machinefile
../machines -np 600 scatttest

The ifconfig of head node for eth0 is:
eth0  Link encap:Ethernet  HWaddr 00:26:B9:56:8B:44
  inet addr:192.168.1.1  Bcast:192.168.1.255  Mask:255.255.255.0
  inet6 addr: fe80::226:b9ff:fe56:8b44/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:1096060373 errors:0 dropped:2512622 overruns:0 frame:0
  TX packets:513387679 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:832328807459 (775.1 GiB)  TX bytes:250824621959 (233.5
GiB)
  Interrupt:106 Memory:d600-d6012800

A typical ifconfig of a compute node is:
eth0  Link encap:Ethernet  HWaddr 00:21:9B:9A:15:AC
  inet addr:192.168.1.253  Bcast:192.168.1.255  Mask:255.255.255.0
  inet6 addr: fe80::221:9bff:fe9a:15ac/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:362716422 errors:0 dropped:0 overruns:0 frame:0
  TX packets:349967746 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:139699954685 (130.1 GiB)  TX bytes:338207741480 (314.9
GiB)
  Interrupt:82 Memory:d600-d6012800


Does anyone help me out of this? It bothers me a lot.

Thank you very much.

Linbao


scatttest.cpp
Description: Binary data


ompi_info
Description: Binary data


Re: [OMPI users] mpi_comm_spawn have problems with group communicators

2010-10-04 Thread Milan Hodoscek
> "Ralph" == Ralph Castain  writes:

Ralph> I'm not sure why the group communicator would make a
Ralph> difference - the code area in question knows nothing about
Ralph> the mpi aspects of the job. It looks like you are hitting a
Ralph> race condition that causes a particular internal recv to
Ralph> not exist when we subsequently try to cancel it, which
Ralph> generates that error message.  How did you configure OMPI?

Thank you for the reply!

Must be some race problem, but I have no control of it, or do I?

These are the configure options that gentoo compiles openmpi-1.4.2 with:

./configure --prefix=/usr --build=x86_64-pc-linux-gnu 
--host=x86_64-pc-linux-gnu --mandir=/usr/share/man --infodir=/usr/share/info 
--datadir=/usr/share --sysconfdir=/etc --localstatedir=/var/lib 
--libdir=/usr/lib64 --sysconfdir=/etc/openmpi --without-xgrid 
--enable-pretty-print-stacktrace --enable-orterun-prefix-by-default 
--without-slurm --enable-contrib-no-build=vt --enable-mpi-cxx 
--disable-io-romio --disable-heterogeneous --without-tm --enable-ipv6



Re: [OMPI users] new open mpi user questions

2010-10-04 Thread Nico Mittenzwey

Ed Peddycoart wrote:
The machines are RHEL 5.4 machines and an older version of Open MPI was 
already installed.  I understand from the faq that I should not simply 
over write the old installation, but would it be better if I remove the 
old version or install the new to a different location.  I have no need 
for the old version.


Remove it, if you don't need it.
If you'll compile from source, you can add the argument 
--prefix=/home/username/openmpi (or similar) to the ./configure command 
to install Open MPI in that location. Just set your PATH and 
LD_LIBRARY_PATH variables accordingly to compile and run your applications.



Are there test apps with the Open MPI distribution that I can use to 
produce some benchmarks?  One thing I want to get an idea of data 
transfer rates as a function of the size of the data transferred. 


A nice bandwidth benchmark is netpipe from 
http://www.scl.ameslab.gov/netpipe/. Compile with "make mpi".


Nico



Re: [OMPI users] Shared memory

2010-10-04 Thread Andrei Fokau
Does OMPI have shared memory capabilities (as it is mentioned in MPI-2)?
How can I use them?

Andrei


On Sat, Sep 25, 2010 at 23:19, Andrei Fokau wrote:

> Here are some more details about our problem. We use a dozen of 4-processor
> nodes with 8 GB memory on each node. The code we run needs about 3 GB per
> processor, so we can load only 2 processors out of 4. The vast majority of
> those 3 GB is the same for each processor and is accessed continuously
> during calculation. In my original question I wasn't very clear asking about
> a possibility to use shared memory with Open MPI - in our case we do not
> need to have a remote access to the data, and it would be sufficient to
> share memory within each node only.
>
> Of course, the possibility to access the data remotely (via mmap) is
> attractive because it would allow to store much larger arrays (up to 10 GB)
> at one remote place, meaning higher accuracy for our calculations. However,
> I believe that the access time would be too long for the data read so
> frequently, and therefore the performance would be lost.
>
> I still hope that some of the subscribers to this mailing list have an
> experience of using Global Arrays. This library seems to be fine for our
> case, however I feel that there should be a simpler solution. Open MPI
> conforms with MPI-2 standard, and the later has a description of shared
> memory application. Do you see any other way for us to use shared memory
> (within node) apart of using Global Arrays?
>
> Andrei
>
>
> On Fri, Sep 24, 2010 at 19:03, Durga Choudhury  wrote:
>
>> I think the 'middle ground' approach can be simplified even further if
>> the data file is in a shared device (e.g. NFS/Samba mount) that can be
>> mounted at the same location of the file system tree on all nodes. I
>> have never tried it, though and mmap()'ing a non-POSIX compliant file
>> system such as Samba might have issues I am unaware of.
>>
>> However, I do not see why you should not be able to do this even if
>> the file is being written to as long as you call msync() before using
>> the mapped pages.
>>
>> Durga
>>
>>
>> On Fri, Sep 24, 2010 at 12:31 PM, Eugene Loh 
>> wrote:
>> > It seems to me there are two extremes.
>> >
>> > One is that you replicate the data for each process.  This has the
>> > disadvantage of consuming lots of memory "unnecessarily."
>> >
>> > Another extreme is that shared data is distributed over all processes.
>> This
>> > has the disadvantage of making at least some of the data less
>> accessible,
>> > whether in programming complexity and/or run-time performance.
>> >
>> > I'm not familiar with Global Arrays.  I was somewhat familiar with HPF.
>> I
>> > think the natural thing to do with those programming models is to
>> distribute
>> > data over all processes, which may relieve the excessive memory
>> consumption
>> > you're trying to address but which may also just put you at a different
>> > "extreme" of this spectrum.
>> >
>> > The middle ground I think might make most sense would be to share data
>> only
>> > within a node, but to replicate the data for each node.  There are
>> probably
>> > multiple ways of doing this -- possibly even GA, I don't know.  One way
>> > might be to use one MPI process per node, with OMP multithreading within
>> > each process|node.  Or (and I thought this was the solution you were
>> looking
>> > for), have some idea which processes are collocal.  Have one process per
>> > node create and initialize some shared memory -- mmap, perhaps, or SysV
>> > shared memory.  Then, have its peers map the same shared memory into
>> their
>> > address spaces.
>> >
>> > You asked what source code changes would be required.  It depends.  If
>> > you're going to mmap shared memory in on each node, you need to know
>> which
>> > processes are collocal.  If you're willing to constrain how processes
>> are
>> > mapped to nodes, this could be easy.  (E.g., "every 4 processes are
>> > collocal".)  If you want to discover dynamically at run time which are
>> > collocal, it would be harder.  The mmap stuff could be in a stand-alone
>> > function of about a dozen lines.  If the shared area is allocated as one
>> > piece, substituting the single malloc() call with a call to your mmap
>> > function should be simple.  If you have many malloc()s you're trying to
>> > replace, it's harder.
>> >
>> > Andrei Fokau wrote:
>> >
>> > The data are read from a file and processed before calculations begin,
>> so I
>> > think that mapping will not work in our case.
>> > Global Arrays look promising indeed. As I said, we need to put just a
>> part
>> > of data to the shared section. John, do you (or may be other users) have
>> an
>> > experience of working with GA?
>> > http://www.emsl.pnl.gov/docs/global/um/build.html
>> > When GA runs with MPI:
>> > MPI_Init(..)  ! start MPI
>> > GA_Initialize()   ! start global arrays
>> > MA_Init(..)   ! start memory allocator
>> > do work
>> > GA_Terminate()! tidy 

[OMPI users] new open mpi user questions

2010-10-04 Thread Ed Peddycoart
I would like to give Open MPI a test drive on some machines I have in my lab 
and I have a few questions...
 
The machines are RHEL 5.4 machines and an older version of Open MPI was already 
installed.  I understand from the faq that I should not simply over write the 
old installation, but would it be better if I remove the old version or install 
the new to a different location.  I have no need for the old version.
 
Are there test apps with the Open MPI distribution that I can use to produce 
some benchmarks?  One thing I want to get an idea of data transfer rates as a 
function of the size of the data transferred.  
 
Thanks,
Ed
 
 


Re: [OMPI users] Granular locks?

2010-10-04 Thread Barrett, Brian W
On Oct 2, 2010, at 2:54 AM, Gijsbert Wiesenekker wrote:

> On Oct 1, 2010, at 23:24 , Gijsbert Wiesenekker wrote:
> 
>> I have a large array that is shared between two processes. One process 
>> updates array elements randomly, the other process reads array elements 
>> randomly. Most of the time these writes and reads do not overlap.
>> The current version of the code uses Linux shared memory with NSEMS 
>> semaphores. When array element i has to be read or updated semaphore (i % 
>> NSEMS) is used. if NSEMS = 1 the entire array will be locked which leads to 
>> unnecessary waits because reads and writes do not overlap most of the time. 
>> Performance increases as NSEMS increases, and flattens out at NSEMS = 32, at 
>> which point the code runs twice as fast when compared to NSEMS = 1.
>> I want to change the code to use OpenMPI RMA, but MPI_Win_lock locks the 
>> entire array, which is similar to NSEMS = 1. Is there a way to have more 
>> granular locks?
>> 
>> Gijsbert
>> 
> 
> Also, is there an MPI_Win_lock equavalent for IPC_NOWAIT?


No.  Every call to MPI_Win_lock will (eventually) result in a locking of the 
window.  Note, however, that MPI_WIN_LOCK returning does not guarantee the 
remote window has been locked.  It only guarantees that it is now safe to call 
data transfer operations targeting that window.  An implementation could (and 
Open MPI frequently does) return immediately, queue up all data transfers until 
some ACK is received from the target, and then begin data movement operations.  
Confusing, but flexible for the wide variety of platforms MPI must target.

Brian

-- 
  Brian W. Barrett
  Dept. 1423: Scalable System Software
  Sandia National Laboratories





Re: [OMPI users] Granular locks?

2010-10-04 Thread Barrett, Brian W
On Oct 1, 2010, at 3:24 PM, Gijsbert Wiesenekker wrote:

> I have a large array that is shared between two processes. One process 
> updates array elements randomly, the other process reads array elements 
> randomly. Most of the time these writes and reads do not overlap.
> The current version of the code uses Linux shared memory with NSEMS 
> semaphores. When array element i has to be read or updated semaphore (i % 
> NSEMS) is used. if NSEMS = 1 the entire array will be locked which leads to 
> unnecessary waits because reads and writes do not overlap most of the time. 
> Performance increases as NSEMS increases, and flattens out at NSEMS = 32, at 
> which point the code runs twice as fast when compared to NSEMS = 1.
> I want to change the code to use OpenMPI RMA, but MPI_Win_lock locks the 
> entire array, which is similar to NSEMS = 1. Is there a way to have more 
> granular locks?

The MPI standard defines MPI_WIN_LOCK as protecting the entire window, so the 
short answer to your question is no.  Depending on your application, it may be 
possible to have multiple windows independent pieces of the data to get the 
behavior you want, but that does seem icky.

Brian

-- 
  Brian W. Barrett
  Dept. 1423: Scalable System Software
  Sandia National Laboratories





Re: [OMPI users] mpi_comm_spawn have problems with group communicators

2010-10-04 Thread Ralph Castain
I'm not sure why the group communicator would make a difference - the code area 
in question knows nothing about the mpi aspects of the job. It looks like you 
are hitting a race condition that causes a particular internal recv to not 
exist when we subsequently try to cancel it, which generates that error message.

How did you configure OMPI?


On Oct 3, 2010, at 6:40 PM, Milan Hodoscek wrote:

> Hi,
> 
> I am a long time happy user of mpi_comm_spawn() routine. But so far I
> used it only with the MPI_COMM_WORLD communicator. Now I want to
> execute more mpi_comm_spawn() routines, by creating and using group
> communicators. However this seems to have some problems. I can get it
> to run about 50% times on my laptop, but on some more "speedy"
> machines it just produces the following message:
> 
> $ mpirun -n 4 a.out
> [ala:31406] [[45304,0],0] ORTE_ERROR_LOG: Not found in file 
> base/plm_base_launch_support.c at line 758
> --
> mpirun was unable to start the specified application as it encountered an 
> error.
> More information may be available above.
> --
> 
> I am attaching the 2 programs needed to test the behavior. Compile:
> $ mpif90 -o sps sps.f08 # spawned program
> $ mpif90 mspbug.f08 # program with problems
> $ mpirun -n 4 a.out
> 
> The compiler is gfortran-4.4.4, and openmpi is 1.4.2.
> 
> Needless to say it runs with mpich2, but mpich2 doesn't know how to
> deal with stdin on a spawned process, so it's useless for my project :-(
> 
> Any ideas?
> 
> -
> program sps
>  use mpi
>  implicit none
>  integer :: ier,nproc,me,pcomm,meroot,mi,on
>  integer, dimension(1:10) :: num
> 
>  call mpi_init(ier)
> 
>  mi=mpi_integer
>  call mpi_comm_rank(mpi_comm_world,me,ier)
>  meroot=0
> 
>  on=1
> 
>  call mpi_comm_get_parent(pcomm,ier)
> 
>  call mpi_bcast(num,on,mi,meroot,pcomm,ier)
>  write(*,*)'sps>me,num=',me,num(on)
> 
>  call mpi_finalize(ier)
> 
> end program sps
> -
> 
> program groupspawn
> 
>  use mpi
> 
>  implicit none
>  ! in the case use mpi does not work (eg Ubuntu) use the include below
>  ! include 'mpif.h'
>  integer :: ier,intercom,nproc,meroot,info,mpierrs(1),mcw
>  integer :: i,myrepsiz,me,np,mcg,repdgrp,repdcom,on,mi,op
>  integer, dimension(1:10) :: myrepgrp
>  character(len=5) :: sarg(1),prog
>  integer, dimension(1:10) :: num,sm
>  integer :: newme,ngrp,igrp
> 
>  call mpi_init(ier)
> 
>  prog='sps'
>  sarg(1) = ''
>  nproc=2
>  on=1
>  meroot=0
>  mcw=mpi_comm_world
>  info=mpi_info_null
>  mi=mpi_integer
>  op=mpi_sum
>  mpierrs(1)=mpi_errcodes_ignore(1)
> 
>  call mpi_comm_rank(mcw,me,ier)
>  call mpi_comm_size(mcw,np,ier)
> 
>  ngrp=2  ! lets have some groups
>  myrepsiz=np/ngrp
>  igrp=me/myrepsiz
>  do i = 1, myrepsiz
>myrepgrp(i)=i+me-mod(me,myrepsiz)-1
>  enddo
> 
>  call mpi_comm_group(mcw,mcg,ier)
>  call mpi_group_incl(mcg,myrepsiz,myrepgrp,repdgrp,ier)
>  call mpi_comm_create(mcw,repdgrp,repdcom,ier)
> 
> !  call mpi_comm_spawn(prog,sarg,nproc,info,meroot,mcw,intercom,mpierrs,ier)
>  call mpi_comm_spawn(prog,sarg,nproc,info,meroot,repdcom,intercom,mpierrs,ier)
> 
>  ! send a number to spawned ones...
> 
>  call mpi_comm_rank(intercom,newme,ier)
>  write(*,*)'me,intercom,newme=',me,intercom,newme
>  num(1)=111*(igrp+1)
> 
>  meroot=mpi_proc_null
>  if(newme == 0) meroot=mpi_root ! to send data
> 
>  call mpi_bcast(num,on,mi,meroot,intercom,ier)
>  ! sometimes there is no output from sps programs, so we wait here: WEIRD :-(
>  !call sleep(1)
> 
>  call mpi_finalize(ier)
> 
> end program groupspawn
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-10-04 Thread Ralph Castain
It looks to me like your remote nodes aren't finding the orted executable. I 
suspect the problem is that you need to forward the path and ld_library_path 
tot he remove nodes. Use the mpirun -x option to do so.


On Oct 4, 2010, at 5:08 AM, Chris Jewell wrote:

> Hi all,
> 
> Firstly, hello to the mailing list for the first time!  Secondly, sorry for 
> the non-descript subject line, but I couldn't really think how to be more 
> specific!  
> 
> Anyway, I am currently having a problem getting OpenMPI to work within my 
> installation of SGE 6.2u5.  I compiled OpenMPI 1.4.2 from source, and 
> installed under /usr/local/packages/openmpi-1.4.2.  Software on my system is 
> controlled by the Modules framework which adds the bin and lib directories to 
> PATH and LD_LIBRARY_PATH respectively when a user is connected to an 
> execution node.  I configured a parallel environment in which OpenMPI is to 
> be used: 
> 
> pe_namempi
> slots  16
> user_lists NONE
> xuser_listsNONE
> start_proc_args/bin/true
> stop_proc_args /bin/true
> allocation_rule$round_robin
> control_slaves TRUE
> job_is_first_task  FALSE
> urgency_slots  min
> accounting_summary FALSE
> 
> I then tried a simple job submission script:
> 
> #!/bin/bash
> #
> #$ -S /bin/bash
> . /etc/profile
> module add ompi gcc
> mpirun hostname
> 
> If the parallel environment runs within one execution host (8 slots per 
> host), then all is fine.  However, if scheduled across  several nodes, I get 
> an error:
> 
> execv: No such file or directory
> execv: No such file or directory
> execv: No such file or directory
> --
> A daemon (pid 1629) died unexpectedly with status 1 while attempting
> to launch so we are aborting.
> 
> There may be more information reported by the environment (see above).
> 
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --
> --
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --
> mpirun: clean termination accomplished
> 
> 
> I'm at a loss on how to start debugging this, and I don't seem to be getting 
> anything useful using the mpirun '-d' and '-v' switches.  SGE logs don't note 
> anything.  Can anyone suggest either what is wrong, or how I might progress 
> with getting more information?
> 
> Many thanks,
> 
> 
> Chris
> 
> 
> 
> --
> Dr Chris Jewell
> Department of Statistics
> University of Warwick
> Coventry
> CV4 7AL
> UK
> Tel: +44 (0)24 7615 0778
> 
> 
> 
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-10-04 Thread Chris Jewell
Hi all,

Firstly, hello to the mailing list for the first time!  Secondly, sorry for the 
non-descript subject line, but I couldn't really think how to be more specific! 
 

Anyway, I am currently having a problem getting OpenMPI to work within my 
installation of SGE 6.2u5.  I compiled OpenMPI 1.4.2 from source, and installed 
under /usr/local/packages/openmpi-1.4.2.  Software on my system is controlled 
by the Modules framework which adds the bin and lib directories to PATH and 
LD_LIBRARY_PATH respectively when a user is connected to an execution node.  I 
configured a parallel environment in which OpenMPI is to be used: 

pe_namempi
slots  16
user_lists NONE
xuser_listsNONE
start_proc_args/bin/true
stop_proc_args /bin/true
allocation_rule$round_robin
control_slaves TRUE
job_is_first_task  FALSE
urgency_slots  min
accounting_summary FALSE

I then tried a simple job submission script:

#!/bin/bash
#
#$ -S /bin/bash
. /etc/profile
module add ompi gcc
mpirun hostname

If the parallel environment runs within one execution host (8 slots per host), 
then all is fine.  However, if scheduled across  several nodes, I get an error:

execv: No such file or directory
execv: No such file or directory
execv: No such file or directory
--
A daemon (pid 1629) died unexpectedly with status 1 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--
--
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--
mpirun: clean termination accomplished


I'm at a loss on how to start debugging this, and I don't seem to be getting 
anything useful using the mpirun '-d' and '-v' switches.  SGE logs don't note 
anything.  Can anyone suggest either what is wrong, or how I might progress 
with getting more information?

Many thanks,


Chris



--
Dr Chris Jewell
Department of Statistics
University of Warwick
Coventry
CV4 7AL
UK
Tel: +44 (0)24 7615 0778