I'm afraid that having 2 cores on a single machine will always outperform 
having 1 core on each machine if any communication is involved.

The most likely thing that is happening is that OMPI is polling waiting for 
messages to arrive. You might look closer at your code to try and optimize it 
better so that number-crunching can get more attention.

Others on this list are far more knowledgeable than I am about doing such 
things, so I'll let them take it from here. Glad it is now running!


On Jul 13, 2010, at 12:22 PM, Robert Walters wrote:

> OpenMPI,
> 
> Following up. The sysadmin opened ports for machine to machine communication 
> and OpenMPI is running successfully with no errors in connectivity_c, 
> hello_c, or ring_c. Since, I have started to implement our MPP software 
> (finite element analysis) that we have, and upon running a simple, 1 core on 
> machine1, 1 core on machine2, job, I notice it is considerably slower than a 
> 2 core job on a single machine. 
> 
> A quick look at top shows me kernel usage is almost twice what cpu usage is! 
> On a 16 core job, (8 cores per node so 2 nodes total) test, OpenMPI was 
> consuming ~65% of the cpu for kernel related items rather than 
> number-crunching related items...Granted, we are running on GigE, but this is 
> a finite element code we are running with no heavy data transfer within it. 
> I'm looking into benchmarking tools, but my sysadmin is not very open to 
> installing third party softwares. Do you have any suggestions for what I can 
> use that would be "big name" or guaranteed safe tools I can use to figure out 
> what's causing the hold up with all the kernel usage? I'm pretty sure its 
> network traffic but I have no way of telling (as far as I know because I'm 
> not a Linux whiz) with the standard tools in RHEL.
> 
> Thanks for all the help! I'm glad to get it finally working and I think with 
> a little tweaking it should be ready to go very soon.
> 
> Regards,
> Robert Walters
> --- On Sat, 7/10/10, Ralph Castain <r...@open-mpi.org> wrote:
> 
> From: Ralph Castain <r...@open-mpi.org>
> Subject: Re: [OMPI users] OpenMPI Hangs, No Error
> To: "Open MPI Users" <us...@open-mpi.org>
> Date: Saturday, July 10, 2010, 4:37 PM
> 
> The "static ports" flag means something different - it is used when the 
> daemon is given a fixed port to use. In some installations, we lock every 
> daemon to the same port number so that each daemon can compute exactly how to 
> contact its peers (i.e., no contact info exchange required for wireup).
> 
> You have a "fixed range", but not "static port", scenario. Hence the message.
> 
> Let us know how it goes - I agree it sounds like something to discuss with 
> the sysadmin.
> 
> 
> On Jul 10, 2010, at 1:47 PM, Robert Walters wrote:
> 
>> I ran oob_tcp_verbose 99 and I am getting something interesting I never got 
>> before.
>> 
>> [machine 2:22347] bind() failed: no port available in the range [60001-60016]
>> [machine 2:22347] mca_oob_tcp_init: unable to create IPv4 listen socket: 
>> Error
>> 
>> I never got that error before we messed with the iptables but now I get that 
>> error... Very interesting, I will have to talk to my sysadmin again and make 
>> sure he opened the right ports on my two test machines. It looks as though 
>> there are no open ports. Another interesting thing is I see that the Daemon 
>> is still report:
>> 
>> Daemon [[28845,0],1] checking in as pid 22347 on host machine 2
>> Daemon [[28845,0],1] not using static ports
>> 
>> Which, I may be misunderstanding, should have been taken care of when I 
>> specified what ports to use. I am telling it a static set of ports... 
>> Anyhow, I will get with my sysadmin again and see what he says. At least 
>> OpenMPI is correctly interpreting the range. 
>> 
>> Thanks for the help.
>> 
>> --- On Sat, 7/10/10, Ralph Castain <r...@open-mpi.org> wrote:
>> 
>> From: Ralph Castain <r...@open-mpi.org>
>> Subject: Re: [OMPI users] OpenMPI Hangs, No Error
>> To: "Open MPI Users" <us...@open-mpi.org>
>> Date: Saturday, July 10, 2010, 3:21 PM
>> 
>> Are there multiple interfaces on your nodes? I'm wondering if we are using a 
>> different network than the one where you opened these ports.
>> 
>> You'll get quite a bit of output, but you can turn on debug output in the 
>> oob itself with -mca oob_tcp_verbose xx. The higher the number, the more you 
>> get.
>> 
>> 
>> On Jul 10, 2010, at 11:14 AM, Robert Walters wrote:
>> 
>>> Hello again,
>>> 
>>> I believe my administrator has opened the ports I requested. The problem I 
>>> am having now is that OpenMPI is not listening to my defined port 
>>> assignments in openmpi-mca-params.conf (looks like permission 644 on those 
>>> files should it be 755?)
>>> 
>>> When I perform netstat -ltnup I see that orted is listening 14 processes in 
>>> tcp but scaterred in the 26000ish port range when I specified 60001-60016 
>>> in the mca-params file. Is there a parameter I am missing? In any case I am 
>>> still hanging as mentioned originally even with the port forwarding enabled 
>>> and specifications in mca-param enabled. 
>>> 
>>> Any other ideas on what might be causing the hang? Is there a more verbose 
>>> mode I can employ to see more deeply into the issue? I have run 
>>> --debug-daemons and --mca plm_base_verbose 99.
>>> 
>>> Thanks!
>>> --- On Tue, 7/6/10, Robert Walters <raw19...@yahoo.com> wrote:
>>> 
>>> From: Robert Walters <raw19...@yahoo.com>
>>> Subject: Re: [OMPI users] OpenMPI Hangs, No Error
>>> To: "Open MPI Users" <us...@open-mpi.org>
>>> Date: Tuesday, July 6, 2010, 5:41 PM
>>> 
>>> Thanks for your expeditious responses, Ralph.
>>> 
>>> Just to confirm with you, I should change openmpi-mca-params.conf to 
>>> include:
>>> 
>>> oob_tcp_port_min_v4 = (My minimum port in the range)
>>> oob_tcp_port_range_v4 = (My port range)
>>> btl_tcp_port_min_v4 = (My minimum port in the range)
>>> btl_tcp_port_range_v4 = (My port range)
>>> 
>>> correct?
>>> 
>>> Also, for a cluster of around 32-64 processes (8 processors per node), how 
>>> wide of a range will I require? I've noticed some entries in the mailing 
>>> list suggesting you need a few to get started and then it opens as 
>>> necessary. Will I be safe with 20 or should I go for 100? 
>>> 
>>> Thanks again for all of your help!
>>> 
>>> --- On Tue, 7/6/10, Ralph Castain <r...@open-mpi.org> wrote:
>>> 
>>> From: Ralph Castain <r...@open-mpi.org>
>>> Subject: Re: [OMPI users] OpenMPI Hangs, No Error
>>> To: "Open MPI Users" <us...@open-mpi.org>
>>> Date: Tuesday, July 6, 2010, 5:31 PM
>>> 
>>> Problem isn't with ssh - the problem is that the daemons need to open a TCP 
>>> connection back to the machine where mpirun is running. If the firewall 
>>> blocks that connection, then we can't run.
>>> 
>>> If you can get a range of ports opened, then you can specify the ports OMPI 
>>> should use for this purpose. If the sysadmin won't allow even that, then 
>>> you are pretty well hosed.
>>> 
>>> 
>>> On Jul 6, 2010, at 2:23 PM, Robert Walters wrote:
>>> 
>>>> Yes, there is a system firewall. I don't think the sysadmin will allow it 
>>>> to go disabled. Each Linux machine has the built-in RHEL firewall. SSH is 
>>>> enabled through the firewall though.
>>>> 
>>>> --- On Tue, 7/6/10, Ralph Castain <r...@open-mpi.org> wrote:
>>>> 
>>>> From: Ralph Castain <r...@open-mpi.org>
>>>> Subject: Re: [OMPI users] OpenMPI Hangs, No Error
>>>> To: "Open MPI Users" <us...@open-mpi.org>
>>>> Date: Tuesday, July 6, 2010, 4:19 PM
>>>> 
>>>> It looks like the remote daemon is starting - is there a firewall in the 
>>>> way?
>>>> 
>>>> On Jul 6, 2010, at 2:04 PM, Robert Walters wrote:
>>>> 
>>>>> Hello all,
>>>>> 
>>>>> I am using OpenMPI 1.4.2 on RHEL. I have a cluster of AMD Opteron's and 
>>>>> right now I am just working on getting OpenMPI itself up and running. I 
>>>>> have a successful configure and make all install. LD_LIBRARY_PATH and 
>>>>> PATH variables were correctly edited. mpirun -np 8 hello_c successfully 
>>>>> works on all machines. I have setup my two test machines with DSA key 
>>>>> pairs that successfully work with each other.
>>>>> 
>>>>> The problem comes when I initiate my hostfile to attempt to communicate 
>>>>> across machines. The hostfile is setup correctly with <host_name> <slots> 
>>>>> <max-slots>. When running with all verbose options enabled "mpirun --mca 
>>>>> plm_base_verbose 99 --debug-daemons --mca btl_base_verbose 30 --mca 
>>>>> oob_base_verbose 99 --mca pml_base_verbose 99 -hostfile hostfile -np 16 
>>>>> hello_c" I receive the following text output.
>>>>> 
>>>>> [machine1:03578] mca: base: components_open: Looking for plm components
>>>>> [machine1:03578] mca: base: components_open: opening plm components
>>>>> [machine1:03578] mca: base: components_open: found loaded component rsh
>>>>> [machine1:03578] mca: base: components_open: component rsh has no 
>>>>> register function
>>>>> [machine1:03578] mca: base: components_open: component rsh open function 
>>>>> successful
>>>>> [machine1:03578] mca: base: components_open: found loaded component slurm
>>>>> [machine1:03578] mca: base: components_open: component slurm has no 
>>>>> register function
>>>>> [machine1:03578] mca: base: components_open: component slurm open 
>>>>> function successful
>>>>> [machine1:03578] mca:base:select: Auto-selecting plm components
>>>>> [machine1:03578] mca:base:select:(  plm) Querying component [rsh]
>>>>> [machine1:03578] mca:base:select:(  plm) Query of component [rsh] set 
>>>>> priority to 10
>>>>> [machine1:03578] mca:base:select:(  plm) Querying component [slurm]
>>>>> [machine1:03578] mca:base:select:(  plm) Skipping component [slurm]. 
>>>>> Query failed to return a module
>>>>> [machine1:03578] mca:base:select:(  plm) Selected component [rsh]
>>>>> [machine1:03578] mca: base: close: component slurm closed
>>>>> [machine1:03578] mca: base: close: unloading component slurm
>>>>> [machine1:03578] mca: base: components_open: Looking for oob components
>>>>> [machine1:03578] mca: base: components_open: opening oob components
>>>>> [machine1:03578] mca: base: components_open: found loaded component tcp
>>>>> [machine1:03578] mca: base: components_open: component tcp has no 
>>>>> register function
>>>>> [machine1:03578] mca: base: components_open: component tcp open function 
>>>>> successful
>>>>> Daemon was launched on machine2- beginning to initialize
>>>>> [machine2:01962] mca: base: components_open: Looking for oob components
>>>>> [machine2:01962] mca: base: components_open: opening oob components
>>>>> [machine2:01962] mca: base: components_open: found loaded component tcp
>>>>> [machine2:01962] mca: base: components_open: component tcp has no 
>>>>> register function
>>>>> [machine2:01962] mca: base: components_open: component tcp open function 
>>>>> successful
>>>>> Daemon [[1418,0],1] checking in as pid 1962 on host machine2
>>>>> Daemon [[1418,0],1] not using static ports
>>>>> 
>>>>> At this point the system hangs indefinitely. While running top on the 
>>>>> machine2 terminal, I see several things come up briefly. These items are: 
>>>>> sshd (root), tcsh (myuser), orted (myuser), and mcstransd (root). I was 
>>>>> wondering if sshd needs to be initiated by myuser? It is currently turned 
>>>>> off in sshd_config through UsePAM yes. This was setup by the sysadmin but 
>>>>> it can be worked around if this is necessary.
>>>>> 
>>>>> So in summary, mpirun works on each machine individually, but hangs when 
>>>>> initiated through a hostfile or with the -host flag. ./configure with 
>>>>> defaults and --prefix. LD_LIBRARY_PATH and PATH set up correctly. Any 
>>>>> help is appreciated. Thanks!
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>>> 
>>>> -----Inline Attachment Follows-----
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> -----Inline Attachment Follows-----
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> -----Inline Attachment Follows-----
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> -----Inline Attachment Follows-----
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to