[OMPI users] MPI Program hangs when runs on more than one host

2010-04-20 Thread long thai
Hi all.

I'm just using OpenMPI for few days. I'll try to run a simple MPI program,
the program is ProcessColors which I get from
CI-Tutor.
I have 2 hosts, if I run the program separately on each one, it runs well.
However, if I run it on two hosts using following command: *mpirun --host
host1,host2 --preload-binary -np 8 ProcessColors*. The program hangs.

When I use command *ps -A* to check running process, I find out that there
is 4 processes running on each host. So, I think that there is a deadlock on
my program, but why it runs well with single host?

All those following commands run without any problem on both machine:

   - mpirun -np 8 ProcessColors
   - mpirun --host host1 -np 8 ProcessColors
   - mpirun --host host2 -np 8 ProcessColors

Later, I found out that the problem comes when the remote host try to send
message to the host which root process (process 0) is running, which is the
host that I run the command. I don't know why the process is blocked at
sending task.

Any help from you is precious to me.

Regards.

Long Thai.


Re: [OMPI users] Error on sending argv

2010-04-20 Thread jody
Hi
You should remove the "&" for the first parameters of your MPI_Send
and MPI_Recv:

MPI_Send(text, strlen(text) + 1, MPI_CHAR, 1, 0, MPI_COMM_WORLD);

MPI_Recv(buffer, 128, MPI_CHAR, MPI_ANY_SOURCE, MPI_ANY_TAG,
MPI_COMM_WORLD, &status);

In C/C++ the name of an array is a pointer to the start of the array
(however, i can't exactly explain
why it worked with the hard-coded string))

Jody


On Mon, Apr 19, 2010 at 6:31 PM, Andrew Wiles  wrote:
> Hi all Open MPI users,
>
> I write a simple MPI program to send a text message to another process. The
> code is below.
>
> (test.c)
>
> #include "mpi.h"
>
> #include 
>
> #include 
>
> #include 
>
>
>
> int main(int argc, char* argv[]) {
>
>     int dest, noProcesses, processId;
>
>     MPI_Status status;
>
>
>
>     char* buffer;
>
>
>
>     char* text = "ABCDEF";
>
>
>
>     MPI_Init(&argc, &argv);
>
>     MPI_Comm_size(MPI_COMM_WORLD, &noProcesses);
>
>     MPI_Comm_rank(MPI_COMM_WORLD, &processId);
>
>
>
>     buffer = (char*) malloc(256 * sizeof(char));
>
>
>
>     if (processId == 0) {
>
>       fprintf(stdout, "Master: sending %s to %d\n", text, 1);
>
>       MPI_Send((void *)&text, strlen(text) + 1, MPI_CHAR, 1, 0,
> MPI_COMM_WORLD);
>
>     } else {
>
>       MPI_Recv(&buffer, 128, MPI_CHAR, MPI_ANY_SOURCE, MPI_ANY_TAG,
> MPI_COMM_WORLD, &status);
>
>       fprintf(stdout, "Slave: received %s from %d\n", buffer,
> status.MPI_SOURCE);
>
>     }
>
>     MPI_Finalize();
>
>     return 0;
>
> }
>
> After compiling and executing it I get the following output:
>
> [root@cluster Desktop]# mpicc -o test test.c
>
> [root@cluster Desktop]# mpirun -np 2 test
>
> Master: sending ABCDEF to 1
>
> Slave: received ABCDEF from 0
>
>
>
> In the source code above, I replace
>
> char* text = "ABCDEF";
>
> by
>
> char* text = argv[1];
>
> then compile and execute it again with the following commands:
>
> [root@cluster Desktop]# mpicc -o test test.c
>
> [root@cluster Desktop]# mpirun -np 2 test ABCDEF
>
> Then I get the following output:
>
> Master: sending ABCDEF to 1
>
> [cluster:03917] *** Process received signal ***
>
> [cluster:03917] Signal: Segmentation fault (11)
>
> [cluster:03917] Signal code: Address not mapped (1)
>
> [cluster:03917] Failing at address: 0xbfa445a2
>
> [cluster:03917] [ 0] [0x959440]
>
> [cluster:03917] [ 1] /lib/libc.so.6(_IO_fprintf+0x22) [0x76be02]
>
> [cluster:03917] [ 2] test(main+0x143) [0x80488b7]
>
> [cluster:03917] [ 3] /lib/libc.so.6(__libc_start_main+0xdc) [0x73be8c]
>
> [cluster:03917] [ 4] test [0x80486c1]
>
> [cluster:03917] *** End of error message ***
>
> --
>
> mpirun noticed that process rank 1 with PID 3917 on node cluster.hpc.org
> exited on signal 11 (Segmentation fault).
>
> --
>
> I’m very confused because the only difference between the two source codes
> is the difference between
>
> char* text = "ABCDEF";
>
> and
>
> char* text = argv[1];
>
> Can any one help me why the results are so different? How can I send argv[i]
> to another process?
>
> Thank you very much!
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



Re: [OMPI users] Incorrect results with MPI-IO with OpenMPI v1.3.0 andbeyond

2010-04-20 Thread E.T.A.vanderWeide
Hi Scott,

I find the same behavior for the test program I posted a couple of days ago. It 
works fine in combination with OpenMPI v1.2, but it produces incorrect results 
for v1.3 and v1.4. I also agree with your suggestion that something is wrong 
with the offsets, because for my test program both processor 0 and 1 read the 
same data, while processor 1 should read the data stored after the data read by 
processor 0.

Regards,

Edwin van der Weide


-Original Message-
From: users-boun...@open-mpi.org on behalf of Samuel Collis
Sent: Mon 4/19/2010 10:06 PM
To: us...@open-mpi.org
Subject: [OMPI users] Incorrect results with MPI-IO with OpenMPI v1.3.0 
andbeyond

Hi all, 

Around a year ago, I posted the attached note regarding apparent incorrect file 
output results when using OpenMPI >= 1.3.0.  I was requested that I generate a 
small, self contained bit of code that demonstrates the issue.  I have attached 
that code to this posting (mpiio.cpp). 

You can build this with: 

  mpicxx mpiio.cpp -o mpiio 

And I execute with the command: 

sh-3.2$ mpiexec -n 1 ~/dgm/src/mpiio; od -e mpi.out 
000   0.000e+00   1.000e+00 
020   2.000e+00   3.000e+00 
040   4.000e+00   5.000e+00 
060   6.000e+00   7.000e+00 
100   8.000e+00   9.000e+00 
120   1.000e+01   1.100e+01 
140   1.200e+01   1.300e+01 
160   1.400e+01   1.500e+01 
200   1.600e+01   1.700e+01 
220   1.800e+01   1.900e+01 
240   2.000e+01   2.100e+01 
260   2.200e+01   2.300e+01 
300 

sh-3.2$ mpiexec -n 2 ~/dgm/src/mpiio; od -e mpi.out 
000   1.200e+01   1.300e+01 
020   1.400e+01   1.500e+01 
040   1.600e+01   1.700e+01 
060   1.800e+01   1.900e+01 
100   2.000e+01   2.100e+01 
120   2.200e+01   2.300e+01 
140   1.200e+01   1.300e+01 
160   1.400e+01   1.500e+01 
200   1.600e+01   1.700e+01 
220   1.800e+01   1.900e+01 
240   2.000e+01   2.100e+01 
260   2.200e+01   2.300e+01 
300 

Note that the program should write out doubles 0-23 and on one processor this 
is 
true.  However, for n=2, it incorrectly writes rank 2's information overtop 
of rank 1's stuff. 

For larger problems it sometimes also drops information -- i.e. One rank 
doesn't even write data at all.  I suspect that the problems are closely 
related.  So see this behavior, use 100 elements (instead of the default 2) 

mpiexec -n 4 ~/dgm/src/mpiio 100; ls -l mpi.out 
-rw-r- 1 user user 2400 Apr 19 12:19 mpi.out 

mpiexec -n 1 ~/dgm/src/mpiio 100; ls -l mpi.out 
-rw-r- 1 user user 9600 Apr 19 12:19 mpi.out 

Note how the -n 4 file is too small. 

Note that with OpenMPI 1.2.7, I have verified that we get the correct 
results: 

$ mpiexec -n 1 mpiio; od -e mpi.out 
000 0.000e+001.000e+00 
020 2.000e+003.000e+00 
040 4.000e+005.000e+00 
060 6.000e+007.000e+00 
100 8.000e+009.000e+00 
120 1.000e+011.100e+01 
140 1.200e+011.300e+01 
160 1.400e+011.500e+01 
200 1.600e+011.700e+01 
220 1.800e+011.900e+01 
240 2.000e+012.100e+01 
260 2.200e+012.300e+01 
300 

$ mpiexec -n 2 mpiio; od -e mpi.out 
000 0.000e+001.000e+00 
020 2.000e+003.000e+00 
040 4.000e+005.000e+00 
060 6.000e+007.000e+00 
100 8.000e+009.000e+00 
120 1.000e+011.100e+01 
140 1.200e+011.300e+01 
160 1.400e+011.500e+01 
200 1.600e+011.700e+01 
220 1.800e+011.900e+01 
240 2.000e+012.100e+01 
260 2.200e+012.300e+01 
300 

Finally, just to prove that it is OpenMPI related, I build the latest MPICH2 
with the results: 

$ ~/local/mpich2/bin/mpiexec -n 1 mpiio-mpich2; od -e mpi.out 
000 0.0

Re: [OMPI users] MPI Program hangs when runs on more than one host

2010-04-20 Thread Changsheng Jiang
I have encountered the same problem too.

By gdb attached, it's show that the processes are in a loop of (e)poll.
After configuring the network interface in ~/.openmpi/mca-params.conf using
btl_tcp_if_include, all hosts work fine.

just fyi.
 Changsheng Jiang


On Tue, Apr 20, 2010 at 14:39, long thai wrote:

> Hi all.
>
> I'm just using OpenMPI for few days. I'll try to run a simple MPI program,
> the program is ProcessColors which I get from 
> CI-Tutor.
> I have 2 hosts, if I run the program separately on each one, it runs well.
> However, if I run it on two hosts using following command: *mpirun --host
> host1,host2 --preload-binary -np 8 ProcessColors*. The program hangs.
>
> When I use command *ps -A* to check running process, I find out that there
> is 4 processes running on each host. So, I think that there is a deadlock on
> my program, but why it runs well with single host?
>
> All those following commands run without any problem on both machine:
>
>- mpirun -np 8 ProcessColors
>- mpirun --host host1 -np 8 ProcessColors
>- mpirun --host host2 -np 8 ProcessColors
>
> Later, I found out that the problem comes when the remote host try to send
> message to the host which root process (process 0) is running, which is the
> host that I run the command. I don't know why the process is blocked at
> sending task.
>
> Any help from you is precious to me.
>
> Regards.
>
> Long Thai.
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] MPI Program hangs when runs on more than one host

2010-04-20 Thread long thai
Hi Changsheng

Thank you very much for your solution. The program runs well now :)

Regards.

On Tue, Apr 20, 2010 at 3:54 PM, Changsheng Jiang wrote:

> I have encountered the same problem too.
>
> By gdb attached, it's show that the processes are in a loop of (e)poll.
> After configuring the network interface in ~/.openmpi/mca-params.conf using
> btl_tcp_if_include, all hosts work fine.
>
> just fyi.
>  Changsheng Jiang
>
>
> On Tue, Apr 20, 2010 at 14:39, long thai wrote:
>
>> Hi all.
>>
>> I'm just using OpenMPI for few days. I'll try to run a simple MPI program,
>> the program is ProcessColors which I get from 
>> CI-Tutor.
>> I have 2 hosts, if I run the program separately on each one, it runs well.
>> However, if I run it on two hosts using following command: *mpirun --host
>> host1,host2 --preload-binary -np 8 ProcessColors*. The program hangs.
>>
>> When I use command *ps -A* to check running process, I find out that
>> there is 4 processes running on each host. So, I think that there is a
>> deadlock on my program, but why it runs well with single host?
>>
>> All those following commands run without any problem on both machine:
>>
>>- mpirun -np 8 ProcessColors
>>- mpirun --host host1 -np 8 ProcessColors
>>- mpirun --host host2 -np 8 ProcessColors
>>
>> Later, I found out that the problem comes when the remote host try to send
>> message to the host which root process (process 0) is running, which is the
>> host that I run the command. I don't know why the process is blocked at
>> sending task.
>>
>> Any help from you is precious to me.
>>
>> Regards.
>>
>> Long Thai.
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] unresolved symbol mca_base_param_reg_int

2010-04-20 Thread Jeff Squyres
Gah!  I didn't look at your error message closely enough the first time -- 
sorry!

Did you perchance upgrade an existing Open MPI installation in place?  I.e., 
have Open MPI 1.2.7 installed in /somewhere and the install Open MPI 
1.3.x/1.4.x into the same /somewhere?

If so, try a full uninstall of Open MPI 1.2.7 from /somewhere first -- or 
install Open MPI 1.4.x into /somewhere_else.

The reason is that Open MPI has a set of plugins that are not necessarily 
compatible between versions, and are not necessarily removed if you just 
install a new version over an old version.



On Apr 19, 2010, at 6:52 PM, Nev wrote:

> Hi Jeff,
> I have tried --disable-visibility but get the same results. Any other
> ideas? I am not able to remove the dlopen, but maybe able to move it to
> directly dlopen the mpi library, instead of my library that is linked to
> mpi. Is this likely to help?
> Nev
> 
> On Mon, 2010-04-19 at 09:21 -0400, Jeff Squyres wrote:
> > It could well be because of the additional dlopen in your application (we 
> > changed some things from the 1.2 series with regards to this kind of stuff).
> >
> > Try configuring Open MPI with the --disable-visibility switch and see if 
> > that helps.
> >
> >
> > On Apr 17, 2010, at 9:05 PM, Nev wrote:
> >
> > > Hi,
> > > I am having a problem running application with OpenMpi version 1.4.1.
> > > The system works with version 1.2.7, but fails with version 1.3.4 and
> > > 1.4.1. (These are the only version I have tried).
> > >
> > > My application is linked against a shared library which does a dlopen of
> > > a 2nd shared "C" library which is compiled and linked using mpicc. The
> > > application and first shared library are C++.
> > > I rebuild and relink the 2nd shared library each time I change the
> > > openmpi build.
> > >
> > > When MPI_init is called I get the following error
> > > symbol lookup error: /opt/openmpi/lib/openmpi/mca_paffinity_linux.so:
> > > undefined symbol: mca_base_param_reg_int
> > >
> > > This does NOT occur with OpenMpi version 1.2.7, Or if I build OpenMpi as
> > > a static library, and then link against this static library.
> > >
> > > I am building a default openmpi except for --prefix=/opt/openmpi and
> > > --enable-static --disable-shared for static library build.
> > >
> > > I would link to be able to use non static openmpi build.
> > >
> > > Any suggestion on what I am doing wrong?
> > >
> > > Thanks Nev
> > >
> > >
> > >
> > >
> > > ___
> > > users mailing list
> > > us...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >
> >
> >
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




[OMPI users] 'readv failed: Connection timed out' issue

2010-04-20 Thread Jonathan Dursi
Hi:

We've got OpenMPI 1.4.1 and Intel MPI running on our 3000 node system.   We 
like OpenMPI for large jobs, because the startup time is much faster (and 
startup is more reliable) than the current defaults with IntelMPI; but we're 
having some pretty serious problems when the jobs are actually running.   When 
running medium- to large- sized jobs (say, anything over 500 cores) over 
ethernet using OpenMPI, several of our users, using a variety of very different 
sorts of codes, report errors like this:

[gpc-f102n010][[30331,1],212][btl_tcp_frag.c:214:mca_btl_tcp_frag_recv] 
mca_btl_tcp_frag_recv: readv failed: Connection timed out (110)

which sometimes hang the job, or sometimes kill it outright:

[gpc-f114n073][[23186,1],109][btl_tcp_frag.c:214:mca_btl_tcp_frag_recv]  
mca_btl_tcp_frag_recv: readv failed: Connection timed out (110)
[gpc-f114n075][[23186,1],125][btl_tcp_frag.c:214:mca_btl_tcp_frag_recv]  
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
mpirun: killing job...

--
mpirun noticed that process rank 0 with PID 9513 on node gpc-f123n025  
exited on signal 0 (Unknown signal 0).
--

We don't see this problem when the same users, using the same codes, use 
IntelMPI.  

Unfortunately, this only happens intermittently, and only with large jobs, so 
it is hard to track down.It seems to happen more reliably with larger 
numbers of processors, but I don't know if that tells us something real about 
the issue, or just that larger N -> better statistics. For one users code, 
it definitely occurs during an MPI_Wait (this particular code has been run on a 
wide variety of machines with a wide variety of MPIs -- which isn't proof of 
correctness of course, but everything looks fine), for others it is less clear. 
  I don't know if it's an OpenMPI issue, or just represents a network issue 
which Intel's MPI happens to be more tolerant of with the default set of  
parameters.   It's also unclear whether or not this issue occurred with earlier 
OpenMPI versions.

Where should I start looking to find out what is going on?   Are there 
parameters that can be adjusted to play with timeouts to see if the issue can 
be localized, or worked around?

- Jonathan
-- 
Jonathan Dursi 







Re: [OMPI users] 'readv failed: Connection timed out' issue

2010-04-20 Thread Terry Dontje

Hi Jonathan,

Do you know what the top level function is or communication pattern? Is 
it some type of collective or a pattern that has a many to one. What 
might be happening is that since OMPI uses a lazy connections by default 
if all processes are trying to establish communications to the same 
process you might run into the below.


You might want to see if setting "--mca mpi_preconnect_all 1" helps any. 
But beware this will cause your startup to increase. However, this might 
give us insight as to whether the problem is flooding a single rank with 
connect requests.


--td

Jonathan Dursi wrote:

Hi:

We've got OpenMPI 1.4.1 and Intel MPI running on our 3000 node system.   We 
like OpenMPI for large jobs, because the startup time is much faster (and 
startup is more reliable) than the current defaults with IntelMPI; but we're 
having some pretty serious problems when the jobs are actually running.   When 
running medium- to large- sized jobs (say, anything over 500 cores) over 
ethernet using OpenMPI, several of our users, using a variety of very different 
sorts of codes, report errors like this:

[gpc-f102n010][[30331,1],212][btl_tcp_frag.c:214:mca_btl_tcp_frag_recv] 
mca_btl_tcp_frag_recv: readv failed: Connection timed out (110)


which sometimes hang the job, or sometimes kill it outright:

[gpc-f114n073][[23186,1],109][btl_tcp_frag.c:214:mca_btl_tcp_frag_recv]  
mca_btl_tcp_frag_recv: readv failed: Connection timed out (110)
[gpc-f114n075][[23186,1],125][btl_tcp_frag.c:214:mca_btl_tcp_frag_recv]  
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)

mpirun: killing job...

--
mpirun noticed that process rank 0 with PID 9513 on node gpc-f123n025  
exited on signal 0 (Unknown signal 0).

--

We don't see this problem when the same users, using the same codes, use IntelMPI.  


Unfortunately, this only happens intermittently, and only with large jobs, so it 
is hard to track down.It seems to happen more reliably with larger numbers of 
processors, but I don't know if that tells us something real about the issue, or 
just that larger N -> better statistics. For one users code, it definitely 
occurs during an MPI_Wait (this particular code has been run on a wide variety of 
machines with a wide variety of MPIs -- which isn't proof of correctness of 
course, but everything looks fine), for others it is less clear.   I don't know if 
it's an OpenMPI issue, or just represents a network issue which Intel's MPI 
happens to be more tolerant of with the default set of  parameters.   It's also 
unclear whether or not this issue occurred with earlier OpenMPI versions.

Where should I start looking to find out what is going on?   Are there 
parameters that can be adjusted to play with timeouts to see if the issue can 
be localized, or worked around?

- Jonathan
  



--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.650.633.7054
Oracle * - Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com 



Re: [OMPI users] 'readv failed: Connection timed out' issue

2010-04-20 Thread Jonathan Dursi
On 2010-04-20, at 9:18AM, Terry Dontje wrote:

> Hi Jonathan,
> 
> Do you know what the top level function is or communication pattern?  Is it 
> some type of collective or a pattern that has a many to one. 

Ah, should have mentioned.  The best-characterized code that we're seeing this 
with is an absolutely standard (logically) regular grid hydrodynamics code, 
only does nearest neighbour communication for exchanging guardcells; the Wait 
in this case is, I think, just a matter of overlapping communication with 
computation of the inner zones.There are things like allreduces in there, 
as well, for setting timesteps, but the communication pattern is overall 
extremely regular and well-behaved.

> What might be happening is that since OMPI uses a lazy connections by default 
> if all processes are trying to establish communications to the same process 
> you might run into the below.
> 
> You might want to see if setting "--mca mpi_preconnect_all 1" helps any.  But 
> beware this will cause your startup to increase.  However, this might give us 
> insight as to whether the problem is flooding a single rank with connect 
> requests.

I'm certainly willing to try it.

- Jonathan

-- 
Jonathan Dursi 







Re: [OMPI users] openMPI configure/Installing problemonMacwithgnu-gcc-4.4.3 / gnu-gcc-4.5

2010-04-20 Thread Jeff Squyres
On Apr 19, 2010, at 12:08 PM, Baowei Liu wrote:

> Sorry I didn't give you such details in my first email. I tried what you 
> said--the tarball attached to this email includes the configure and make 
> output information:
> 
> ./configure --prefix=/usr/local/openmpi | tee config.out
> sudo make all > make.out
> 
> The same error appeared:
> 
> libtool: compile:  gcc -DHAVE_CONFIG_H -I. 
> -I../../../../../ompi/mca/coll/hierarch -I../../../../opal/include 
> -I../../../../orte/include -I../../../../ompi/include 
> -I../../../../opal/mca/paffinity/linux/plpa/src/libplpa -I../../../../.. 
> -I../../../.. -I../../../../../opal/include -I../../../../../orte/include 
> -I../../../../../ompi/include -D_REENTRANT -O3 -DNDEBUG -finline-functions 
> -fno-strict-aliasing -fvisibility=hidden -MT coll_hierarch.lo -MD -MP -MF 
> .deps/coll_hierarch.Tpo -c 
> ../../../../../ompi/mca/coll/hierarch/coll_hierarch.c  -fno-common -DPIC -o 
> .libs/coll_hierarch.o
> as: more than one -arch option (not allowed, use cc(1) instead)
> make[2]: *** [coll_hierarch.lo] Error 1
> make[1]: *** [all-recursive] Error 1
> make: *** [all-recursive] Error 1
> 
> As I said, I tried to get rid of this error by setting "-arch" option when 
> configure, like:

Ah, ok, now I understand what you tried; thanks.

The above error message is a little puzzling because Open MPI is not providing 
any -arch flags on the compile command line.  But it's the assembler that is 
complaining (as).  Weird.

The source file where the problem is occuring isn't particularly special 
(ompi/mca/coll/hierarch/coll_hiearch.c).  I can't imagine why it would cause 
this issue. :-\

Try copy-n-pasting the "gcc ... .libs/coll_hierarch.o" command line to a shell 
and running it in the ompi/mca/coll/hierarch directory and see if you can get 
it to run.  Try snipping out the -O3 and see if that helps. Try removing 
-fvisibility, etc.  See if you can get it to go by selectively removing command 
line flags.

Other than that, I'm out of ideas.  It sounds like it could be either a 
compiler bug, or some kind of bad interaction between your different compiler / 
assembler versions on your system...?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] unresolved symbol mca_base_param_reg_int

2010-04-20 Thread Nev
Hi Jeff,
I did the install to the "same place". I always use /opt/openmpi, the
procedure I use when building is
configure --prefix=/opt/openmpi ...
rm -r /opt/openmpi/*
make clean
make all
make install
is this sufficient to un-install previous version, or is more required.

On Tue, 2010-04-20 at 07:59 -0400, Jeff Squyres wrote:
> Gah!  I didn't look at your error message closely enough the first time -- 
> sorry!
> 
> Did you perchance upgrade an existing Open MPI installation in place?  I.e., 
> have Open MPI 1.2.7 installed in /somewhere and the install Open MPI 
> 1.3.x/1.4.x into the same /somewhere?
> 
> If so, try a full uninstall of Open MPI 1.2.7 from /somewhere first -- or 
> install Open MPI 1.4.x into /somewhere_else.
> 
> The reason is that Open MPI has a set of plugins that are not necessarily 
> compatible between versions, and are not necessarily removed if you just 
> install a new version over an old version.
> 
> 
> 
> On Apr 19, 2010, at 6:52 PM, Nev wrote:
> 
> > Hi Jeff,
> > I have tried --disable-visibility but get the same results. Any other
> > ideas? I am not able to remove the dlopen, but maybe able to move it to
> > directly dlopen the mpi library, instead of my library that is linked to
> > mpi. Is this likely to help?
> > Nev
> > 
> > On Mon, 2010-04-19 at 09:21 -0400, Jeff Squyres wrote:
> > > It could well be because of the additional dlopen in your application (we 
> > > changed some things from the 1.2 series with regards to this kind of 
> > > stuff).
> > >
> > > Try configuring Open MPI with the --disable-visibility switch and see if 
> > > that helps.
> > >
> > >
> > > On Apr 17, 2010, at 9:05 PM, Nev wrote:
> > >
> > > > Hi,
> > > > I am having a problem running application with OpenMpi version 1.4.1.
> > > > The system works with version 1.2.7, but fails with version 1.3.4 and
> > > > 1.4.1. (These are the only version I have tried).
> > > >
> > > > My application is linked against a shared library which does a dlopen of
> > > > a 2nd shared "C" library which is compiled and linked using mpicc. The
> > > > application and first shared library are C++.
> > > > I rebuild and relink the 2nd shared library each time I change the
> > > > openmpi build.
> > > >
> > > > When MPI_init is called I get the following error
> > > > symbol lookup error: /opt/openmpi/lib/openmpi/mca_paffinity_linux.so:
> > > > undefined symbol: mca_base_param_reg_int
> > > >
> > > > This does NOT occur with OpenMpi version 1.2.7, Or if I build OpenMpi as
> > > > a static library, and then link against this static library.
> > > >
> > > > I am building a default openmpi except for --prefix=/opt/openmpi and
> > > > --enable-static --disable-shared for static library build.
> > > >
> > > > I would link to be able to use non static openmpi build.
> > > >
> > > > Any suggestion on what I am doing wrong?
> > > >
> > > > Thanks Nev
> > > >
> > > >
> > > >
> > > >
> > > > ___
> > > > users mailing list
> > > > us...@open-mpi.org
> > > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > >
> > >
> > >
> > 
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > 
> 
> 



Re: [OMPI users] OS X - Can't find the absoft directory

2010-04-20 Thread Paul Cizmas
Is it possible to have two openmpi-s on the same computer?  I have  
openmpi 1.3.2 working fine with gfortran but I cannot build openmpi  
1.4.1 with Absoft - I get this message from libtool:


/bin/sh ../../../libtool   --mode=compile /Applications/Absoft11.0/bin/ 
f90 -I../../../ompi/include -I../../../ompi/include -p. -I. -I../../../ 
ompi/mpi/f90  -lU77 -c -o mpi.lo mpi.f90
libtool: compile:  /Applications/Absoft11.0/bin/f90 -I../../../ompi/ 
include -I../../../ompi/include -p. -I. -I../../../ompi/mpi/f90 -lU77 - 
c mpi.f90  -o .libs/mpi.o

Can't find the absoft directory.
Please set the ABSOFT environment variable and try again.
make[4]: *** [mpi.lo] Error 1

Note that ABSOFT is properly set as in fact shown above on the first  
line.  In addition, the absolute address of the f90 (/Applications/ 
Absoft11.0/bin/f90) is correct.


To recreate the problem I went to folder openmpi-1.4.1/ompi/mpi/f90,  
checked again ABSOFT variable and called libtool.  The result is  
obviously the same:


sudo /bin/sh ../../../libtool   --mode=compile /Applications/ 
Absoft11.0/bin/f90 -I../../../ompi/include -I../../../ompi/include -p.  
-I. -I../../../ompi/mpi/f90  -lU77 -c -o mpi.lo mpi.f90

Password:
libtool: compile:  /Applications/Absoft11.0/bin/f90 -I../../../ompi/ 
include -I../../../ompi/include -p. -I. -I../../../ompi/mpi/f90 -lU77 - 
c mpi.f90  -o .libs/mpi.o

Can't find the absoft directory.
Please set the ABSOFT environment variable and try again.

I am inclined to say that if it is not something that has to do with  
the existing openmpi 1.3.2 and gfortran interfering with the 1.4.1 and  
Absoft, then it might be a bug in openmpi.


Paul



On Apr 19, 2010, at 11:20 AM, Jeff Squyres wrote:


On Apr 19, 2010, at 12:11 PM, Paul Cizmas wrote:


Here there was a difference - it did work for both cases:

~$ ABSOFT=foo
~$ testme
ABSOFT=foo
~$ export ABSOFT=foo
~$ testme
ABSOFT=foo
~$


This could well be because you had previously exported ABSOFT...?   
(I forget the exact semantics offhand)


I'm somewhat at a loss to explain the behavior you're seeing, then.   
In this regard, OMPI is a pretty standard configure/make open source  
project -- we're not frobbing the environment before calling the  
underlying libtool script (this stuff is totally handled by  
Automake, actually).


Some off-the-wall-ideas:

1. Is $ABSOFT set to the correct value?  I.e., could the error  
message be interpreted as "The absoft compiler was unable to find  
what it expected to find in $ABSOFT"?


2. Is there anything different/unique about your build environment  
compared to the environment you just ran those tests in?


3. You might want to try editing "libtool" script that is emitted  
after running OMPI's configure and add some debugging to see if  
$ABSOFT really is set when that script is launched.


--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] openMPI configure/Installing problemonMacwithgnu-gcc-4.4.3 / gnu-gcc-4.5

2010-04-20 Thread Baowei Liu
Thanks a lot, Jeff. I'll try what you told me and let you know the result.

On Tue, Apr 20, 2010 at 2:03 PM, Jeff Squyres  wrote:

> On Apr 19, 2010, at 12:08 PM, Baowei Liu wrote:
>
> > Sorry I didn't give you such details in my first email. I tried what
> you said--the tarball attached to this email includes the configure and make
> output information:
> >
> > ./configure --prefix=/usr/local/openmpi | tee config.out
> > sudo make all > make.out
> >
> > The same error appeared:
> >
> > libtool: compile:  gcc -DHAVE_CONFIG_H -I.
> -I../../../../../ompi/mca/coll/hierarch -I../../../../opal/include
> -I../../../../orte/include -I../../../../ompi/include
> -I../../../../opal/mca/paffinity/linux/plpa/src/libplpa -I../../../../..
> -I../../../.. -I../../../../../opal/include -I../../../../../orte/include
> -I../../../../../ompi/include -D_REENTRANT -O3 -DNDEBUG -finline-functions
> -fno-strict-aliasing -fvisibility=hidden -MT coll_hierarch.lo -MD -MP -MF
> .deps/coll_hierarch.Tpo -c
> ../../../../../ompi/mca/coll/hierarch/coll_hierarch.c  -fno-common -DPIC -o
> .libs/coll_hierarch.o
> > as: more than one -arch option (not allowed, use cc(1) instead)
> > make[2]: *** [coll_hierarch.lo] Error 1
> > make[1]: *** [all-recursive] Error 1
> > make: *** [all-recursive] Error 1
> >
> > As I said, I tried to get rid of this error by setting "-arch" option
> when configure, like:
>
> Ah, ok, now I understand what you tried; thanks.
>
> The above error message is a little puzzling because Open MPI is not
> providing any -arch flags on the compile command line.  But it's the
> assembler that is complaining (as).  Weird.
>
> The source file where the problem is occuring isn't particularly special
> (ompi/mca/coll/hierarch/coll_hiearch.c).  I can't imagine why it would cause
> this issue. :-\
>
> Try copy-n-pasting the "gcc ... .libs/coll_hierarch.o" command line to a
> shell and running it in the ompi/mca/coll/hierarch directory and see if you
> can get it to run.  Try snipping out the -O3 and see if that helps. Try
> removing -fvisibility, etc.  See if you can get it to go by selectively
> removing command line flags.
>
> Other than that, I'm out of ideas.  It sounds like it could be either a
> compiler bug, or some kind of bad interaction between your different
> compiler / assembler versions on your system...?
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] openMPI configure/InstallingproblemonMacwithgnu-gcc-4.4.3 / gnu-gcc-4.5

2010-04-20 Thread Jeff Squyres
On Apr 20, 2010, at 7:25 PM, Baowei Liu wrote:

> Thanks a lot, Jeff. I'll try what you told me and let you know the result.

Someone else pointed out to me off-list that you're using "sudo" for make all 
-- do you need to?  Most people build as an unprivileged user and then only use 
"sudo" for "make install".  I don't know if sudo's environment would be mucking 
you up so deep in the build, but it's worth trying without it...?

Additionally, they pointed out that if you pass "-v" to CFLAGS (or just add it 
to the cut-n-paste of this particular command line), you should see all the 
commands that gcc is invoking under the covers.  That might be useful to see 
what's going on in this specific command line.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] unresolved symbol mca_base_param_reg_int

2010-04-20 Thread Jeff Squyres
On Apr 20, 2010, at 6:16 PM, Nev wrote:

> Hi Jeff,
> I did the install to the "same place". I always use /opt/openmpi, the
> procedure I use when building is
> configure --prefix=/opt/openmpi ...
> rm -r /opt/openmpi/*
> make clean
> make all
> make install
> is this sufficient to un-install previous version, or is more required.

Yes, that should be sufficient.  Is that what you did this time?  

If so, is there any way you can provide a small code example of the problem 
you're seeing?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] openMPI configure/InstallingproblemonMacwithgnu-gcc-4.4.3 / gnu-gcc-4.5

2010-04-20 Thread Baowei Liu
Thank you so much, Jeff.  It works!!!

I followed instructions in INSTALL file to make a new directory ./build. But
when I configured it in this new directory, I got "permission denied" error.
That's why I used "sudo".

Now I try to configure it directly under the openmpi-X.Y.Z directory:

./configure --prefix=.../openmpi
make all
sudo make install

test with mpi_helloworld.f90, it works just fine.

Thanks again for your time and help!

On Tue, Apr 20, 2010 at 7:38 PM, Jeff Squyres  wrote:

> On Apr 20, 2010, at 7:25 PM, Baowei Liu wrote:
>
> > Thanks a lot, Jeff. I'll try what you told me and let you know the
> result.
>
> Someone else pointed out to me off-list that you're using "sudo" for make
> all -- do you need to?  Most people build as an unprivileged user and then
> only use "sudo" for "make install".  I don't know if sudo's environment
> would be mucking you up so deep in the build, but it's worth trying without
> it...?
>
> Additionally, they pointed out that if you pass "-v" to CFLAGS (or just add
> it to the cut-n-paste of this particular command line), you should see all
> the commands that gcc is invoking under the covers.  That might be useful to
> see what's going on in this specific command line.
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>