Re: [OMPI users] Multi-threading with OpenMPI ?

2009-09-13 Thread Ashika Umanga Umagiliya
One more modification , I do not call MPI_Finalize() from the 
"libParallel.so" library.


Ashika Umanga Umagiliya wrote:

Greetings all,

After some reading , I found out that I have to build openMPI using 
"--enable-mpi-threads"
After thatm I changed MPI_INIT() code in my "libParallel.so" and in 
"parallel-svr" (please refer to http://i27.tinypic.com/mtqurp.jpg ) to :


 int sup;
MPI_Init_thread(NULL,NULL,MPI_THREAD_MULTIPLE,);

Now when multiple requests comes (multiple threads) MPI gives 
following two errors:


"[umanga:06127] [[8004,1],0] ORTE_ERROR_LOG: Data 
unpack would read past end of buffer in file dpm_orte.c at line 
299

[umanga:6127] *** An error occurred in MPI_Comm_spawn
[umanga:6127] *** on communicator MPI_COMM_SELF
[umanga:6127] *** MPI_ERR_UNKNOWN: unknown error
[umanga:6127] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[umanga:06126] [[8004,0],0]-[[8004,1],0] mca_oob_tcp_msg_recv: readv 
failed: Connection reset by peer (104)
-- 


mpirun has exited due to process rank 0 with PID 6127 on
node umanga exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
"

or  sometimes :

"[umanga:5477] *** An error occurred in MPI_Comm_spawn
[umanga:5477] *** on communicator MPI_COMM_SELF
[umanga:5477] *** MPI_ERR_UNKNOWN: unknown error
[umanga:5477] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[umanga:05477] [[7630,1],0] ORTE_ERROR_LOG: Data 
unpack would read past end of buffer in file dpm_orte.c at line 
299
-- 


mpirun has exited due to process rank 0 with PID 5477 on
node umanga exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--" 




Any tips ?

Thank you

Ashika Umanga Umagiliya wrote:

Greetings all,

Please refer to image at:
http://i27.tinypic.com/mtqurp.jpg

Here the process illustrated in the image:

1) C++ Webservice loads the "libParallel.so" when it starts up. (dlopen)
2) When a new request comes from a client,*new thread* is created,  
SOAP data is bound to C++ objects and calcRisk() method of webservice 
invoked.Inside this method, "calcRisk()" of "libParallel" is invoked 
(using dlsym ..etc)
3) Inside "calcRisk()" of "libParallel" ,it spawns "parallel-svr" MPI 
application.
(I am using boost MPI and boost serializarion to send 
custom-data-types across spawned processes.)
4) "parallel-svr" (MPI Application in image) execute the parallel 
logic and send the result back to "libParallel.so" using boost MPI 
send..etc.
5) "libParallel.so" send the result to webservice,bind into SOAP and 
sent result to client and the thread ends.


My problem is :

Everthing works fine for the first request from the client,
For the second request it throws an error (i assume from 
libParallel.so") saying:


"-- 


Calling any MPI-function after calling MPI_Finalize is erroneous.
The only exceptions are MPI_Initialized, MPI_Finalized and 
MPI_Get_version.
-- 


*** An error occurred in MPI_Init
*** after MPI was finalized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[umanga:19390] Abort after MPI_FINALIZE completed successfully; not 
able to guarantee that all other processes were killed!"



Is this because of multithreading ? Any idea how to fix this ?

Thanks in advance,
umanga







Re: [OMPI users] Multi-threading with OpenMPI ?

2009-09-13 Thread Ashika Umanga Umagiliya

Greetings all,

After some reading , I found out that I have to build openMPI using 
"--enable-mpi-threads"
After thatm I changed MPI_INIT() code in my "libParallel.so" and in 
"parallel-svr" (please refer to http://i27.tinypic.com/mtqurp.jpg ) to :


 int sup;
MPI_Init_thread(NULL,NULL,MPI_THREAD_MULTIPLE,);

Now when multiple requests comes (multiple threads) MPI gives following 
two errors:


"[umanga:06127] [[8004,1],0] ORTE_ERROR_LOG: Data 
unpack would read past end of buffer in file dpm_orte.c at line 
299

[umanga:6127] *** An error occurred in MPI_Comm_spawn
[umanga:6127] *** on communicator MPI_COMM_SELF
[umanga:6127] *** MPI_ERR_UNKNOWN: unknown error
[umanga:6127] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[umanga:06126] [[8004,0],0]-[[8004,1],0] mca_oob_tcp_msg_recv: readv 
failed: Connection reset by peer (104)

--
mpirun has exited due to process rank 0 with PID 6127 on
node umanga exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
"

or  sometimes :

"[umanga:5477] *** An error occurred in MPI_Comm_spawn
[umanga:5477] *** on communicator MPI_COMM_SELF
[umanga:5477] *** MPI_ERR_UNKNOWN: unknown error
[umanga:5477] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[umanga:05477] [[7630,1],0] ORTE_ERROR_LOG: Data 
unpack would read past end of buffer in file dpm_orte.c at line 
299

--
mpirun has exited due to process rank 0 with PID 5477 on
node umanga exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--"


Any tips ?

Thank you

Ashika Umanga Umagiliya wrote:

Greetings all,

Please refer to image at:
http://i27.tinypic.com/mtqurp.jpg

Here the process illustrated in the image:

1) C++ Webservice loads the "libParallel.so" when it starts up. (dlopen)
2) When a new request comes from a client,*new thread* is created,  
SOAP data is bound to C++ objects and calcRisk() method of webservice 
invoked.Inside this method, "calcRisk()" of "libParallel" is invoked 
(using dlsym ..etc)
3) Inside "calcRisk()" of "libParallel" ,it spawns "parallel-svr" MPI 
application.
(I am using boost MPI and boost serializarion to send 
custom-data-types across spawned processes.)
4) "parallel-svr" (MPI Application in image) execute the parallel 
logic and send the result back to "libParallel.so" using boost MPI 
send..etc.
5) "libParallel.so" send the result to webservice,bind into SOAP and 
sent result to client and the thread ends.


My problem is :

Everthing works fine for the first request from the client,
For the second request it throws an error (i assume from 
libParallel.so") saying:


"-- 


Calling any MPI-function after calling MPI_Finalize is erroneous.
The only exceptions are MPI_Initialized, MPI_Finalized and 
MPI_Get_version.
-- 


*** An error occurred in MPI_Init
*** after MPI was finalized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[umanga:19390] Abort after MPI_FINALIZE completed successfully; not 
able to guarantee that all other processes were killed!"



Is this because of multithreading ? Any idea how to fix this ?

Thanks in advance,
umanga





Re: [OMPI users] undefined symbol error when built as a sharedlibrary

2009-09-13 Thread Ashika Umanga Umagiliya

Hi Jeff,

Thanks jeff, that clears everything .
Now I remember,few time ago I came up with a issue like this when using 
Dynamic loading  (dlopen,dlsym..etc) and later I had to use shared 
libraries.. I think ,this is same as that.


thanks again
umanga

Jeff Squyres wrote:

On Sep 10, 2009, at 9:42 PM, Ashika Umanga Umagiliya wrote:


That fixed the problem !
You are indeed a voodoo master... could you explain the spell behind
your magic :)



The problem has to do with how plugins (aka dynamic shared objects, 
DSO's) are loaded.  When a DSO is loaded into a Linux process, it has 
the option of making all the public symbols in that DSO public to the 
rest of the process or private within its own scope.


Let's back up.  Remember that Open MPI is based on plugins (DSO's).  
It loads lots and lots of plugins during execution (mostly during 
MPI_INIT).  These plugins call functions in OMPI's public libraries 
(e.g., they call functions in libmpi.so).  Hence, when the plugin 
DSO's are loaded, they need to be able to resolve these symbols into 
actual code that can be invoked.  If the symbols cannot be resolved, 
the DSO load fails.


If libParallel.so is loaded into a private scope, then its linked 
libraries (e.g., libmpi.so) are also loaded into that same private 
scope.  Hence, all of libmpi.so's public symbols are only public 
within that single, private scope.  Then, when OMPI goes to load its 
own DSOs, since libmpi.so's public symbols are in a private scope, 
OMPI's DSO's can't find them -- and therefore they refuse to load.  
(private scopes are not inherited -- a new DSO load cannot "see" 
libParallel.so/libmpi.so's private scope).


It's an educated guess from your description that this is what was 
happening.


OMPI's --disable-dlopen configure option has Open MPI build in a 
different way.  Instead of building all of OMPI's plugins as DSOs, 
they are "slurped" up into libmpi.so (etc.).  So there's no "loading" 
of DSOs at MPI_INIT time -- the plugin code actually resides *in* 
libmpi.so itself.  Hence, resolution of all symbols is done when 
libParallel.so loads libmpi.so.  Additionally, there's no secondary 
private scope created when DSOs are loaded -- they're all 
self-contained within libmpi.so (etc.).  And therefore all the 
libmpi.so symbols that are required for the plugins are all able to be 
found/resolved at load time.


Does that make sense?




Regards,
umanga


Jeff Squyres wrote:
> I'm guessing that this has to do with deep, dark voodoo involved with
> the run time linker.
>
> Can you try configuring/building Open MPI with --disable-dlopen
> configure option, and rebuilding your libParallel.so against the new
> libmpi.so?
>
> See if that fixes the problem for you.  If it does, I can explain in
> more detail (if you care).
>
>
> On Sep 10, 2009, at 3:24 AM, Ashika Umanga Umagiliya wrote:
>
>> Greetings all,
>>
>> My parallel application is build as a shared library 
(libParallel.so).

>> (I use Debian Lenny 64bit).
>>  A webservice is used to dynamically load libParallel.so and inturn
>> execute the parallel process .
>>
>> But during runtime I get the error :
>>
>> webservicestub: symbol lookup error:
>> /usr/local/lib/openmpi/mca_paffinity_linux.so: undefined symbol:
>> mca_base_param_reg_int
>>
>> which I cannot figure out.I followed every 'ldd' and 'nm' seems
>> everything is fine.
>> So I compiled and tested my parallel code as an executable and 
then it

>> worked fine.
>>
>> What could be the reason for this?
>>
>> Thanks in advance,
>> umanga
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users








[OMPI users] Fails to run "MPI_Comm_spawn" on remote host

2009-09-13 Thread Jaison Paul

Hi,

I am trying to create a library using OpenMPI for an SOA middleware  
for my Phd research. "MPI_Comm_spawn"  is the one I need to go for. I  
got a sample example working, but only on the local host. Whenever I  
try to run the spawned children on  a remote hosts, parent cannot  
launch children on remote hosts and I get the following error message:


--BEGIN MPIRUN AND ERROR MSG
mpirun --prefix /opt/mpi/ompi-1.3.2/ --mca btl_tcp_if_include eth0 - 
np 1 /home/jaison/mpi/advanced_MPI/spawn/manager

Manager code started - host headnode -- myid & world_size 0 1
Host is: myhost
WorkDir is: /home/jaison/mpi/advanced_MPI/spawn/lib
 
--

There are no allocated resources for the application
  /home/jaison/mpi/advanced_MPI/spawn//lib
that match the requested mapping:


Verify that you have mapped the allocated resources properly using the
--host or --hostfile specification.
 
--
 
--
A daemon (pid unknown) died unexpectedly on signal 1  while  
attempting to

launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to  
have the

location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
 
--

mpirun: clean termination accomplished
--END OF ERROR  
MSG---


I use the reserved keys - 'host' & 'wdir' - to set the remote host  
and work directory using MPI_Info. Here is the code snippet:


--BEGIN Code  
Snippet---

  MPI_Info hostinfo;
  MPI_Info_create();
  MPI_Info_set(hostinfo, "host", "myhost");
  MPI_Info_set(hostinfo, "wdir", "/home/jaison/mpi/advanced_MPI/ 
spawn/lib");


  // Checking for 'hostinfo'. The results are okay (see above)
  int test0 = MPI_Info_get(hostinfo, "host", valuelen, value, );
  int test = MPI_Info_get(hostinfo, "wdir", valuelen, value1, );
  printf("Host is: %s\n", value);
  printf("WorkDir is: %s\n", value1);

  sprintf( launched_program, "launched_program" );

  MPI_Comm_spawn( launched_program, MPI_ARGV_NULL , number_to_spawn,
  hostinfo, 0, MPI_COMM_SELF, ,
  MPI_ERRCODES_IGNORE );

--END OF Code  
Snippet---


I've set the LD_LIBRARY_PATH correctly. Is "MPI_Comm_spawn"  
implemented in open mpi (I am using version 1.3.2)? If so, where am I  
going wrong? Any input will be very much appreciated.


Thanking you in advance.

Jaison
jmule...@cs.anu.edu.au
http://cs.anu.edu.au/~Jaison.Mulerikkal/Home.html






Re: [OMPI users] How to build OMPI with Checkpoint/restart.

2009-09-13 Thread Marcin Stolarek
I've tryed another time.  Here is what I get when trying to run
using-1.4a1r21964 :

(terminus:~) mstol% mpirun --am ft-enable-cr ./a.out
--
It looks like opal_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during opal_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  opal_cr_init() failed failed
  --> Returned value -1 instead of OPAL_SUCCESS
--
[terminus:06120] [[INVALID],INVALID] ORTE_ERROR_LOG: Error in file
runtime/orte_
init.c at line 79
--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: orte_init failed
  --> Returned "Error" (-1) instead of "Success" (0)
--
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[terminus:6120] Abort before MPI_INIT completed successfully; not able to
guaran
tee that all other processes were killed!
--
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--

I've included config.log and ompi_info --all output in attacment
LD_LIBRARY_PATH is set correctly.
Any idea?

marcin





2009/9/12 Marcin Stolarek 

> Hi,
> I'm trying  to compile OpenMPI with  checkpoint restart via BLCR. I'm not
> sure which path shoul I set as a value of --with-blcr option.
> I'm using 1.3.3 release, which version of BLCR should I use?
>
> I've compiled the newest version of BLCR with --prefix=$BLCR, and I've
> putten as a option to openmpi configure --with-blcr=$BLCR, but I recived:
>
>
> configure:76646: checking if MCA component crs:blcr can compile
> configure:76648: result: no
>
> marcin
>
>
>
>
>


info.tar.gz
Description: GNU Zip compressed data