[OMPI users] OpenMPI failed when running across two mac machines

2012-01-20 Thread Teng Lin
Hi,

We are distributing OpenMPI as part of software suite. Therefore, the prefix we 
used for building is not expected to be the same when running on customer's 
machine. However, we did manage to get it running by setting OPLA_PREFIX, PATH 
and LD_LIBARAY_PATH on Linux). We tried do the same thing on Mac by using  
DYLD_LIBRARY_PATH instead of LD_LIBARAY_PATH. Unfortunately, we run into below 
error:
dyld: lazy symbol binding failed: Symbol not found: _orte_daemon


After looking at orte/mca/plm/rsh/plm_rsh_module.c, we realized that the 
problem is DYLD_LIBARAY_PATH is not setting before launching orted. Instead 
plm_rsh_module.c still tries to set LD_LIBRARY_PATH. We had a patch for this 
and it seems to address the issue. We will be thrilled if the attached patch 
will be accepted.


Teng


plm_rsh_module.c.patch
Description: Binary data


[OMPI users] OpenMPI failed when running across two mac machines

2012-01-20 Thread Teng Lin
Hi,

We are distributing OpenMPI as part of software suite. Therefore, the prefix we 
used for building is not expected to be the same when running on customer's 
machine. However, we did manage to get it running by setting OPLA_PREFIX, PATH 
and LD_LIBARAY_PATH on Linux). We tried do the same thing on Mac by using  
DYLD_LIBRARY_PATH instead of LD_LIBARAY_PATH. Unfortunately, we run into below 
error:
dyld: lazy symbol binding failed: Symbol not found: _orte_daemon


After looking at orte/mca/plm/rsh/plm_rsh_module.c, we realized that the 
problem is DYLD_LIBARAY_PATH is not setting before launching orted. Instead 
plm_rsh_module.c still tries to set LD_LIBRARY_PATH. We had a patch for this 
and it seems to address the issue. We will be thrilled if the attached patch 
will be accepted.


Teng


plm_rsh_module.c.patch
Description: Binary data


Re: [OMPI users] deadlock when calling MPI_gatherv

2010-04-27 Thread Teng Lin
Hi Terry,


> How does the stack for the non-SM BTL run look, I assume it probably is the 
> same?  Also, can you dump the message queues for rank 1?  What's interesting 
> is you have a bunch of pending receives, do you expect that to be the case 
> when the MPI_Gatherv occurred?

It turns out we have an unbalanced MPI_Bcast buried very deep in the 
application. After fixing that bug, the application behaves correctly.
Thank you all for the help, and sorry for the false alarm.

Teng

Re: [OMPI users] deadlock when calling MPI_gatherv

2010-04-26 Thread Teng Lin

On Apr 26, 2010, at 9:07 PM, Trent Creekmore wrote:

> You are going to have to debug and trace the program to find out where it is
> stopping.
> You may want to try using KDbg, a graphical front end for the command line
> debugger dbg, which makes it a LOT easier, or use Eclipse.

As a matter of fact, I did use a debugger. Totalview in this case. But it gets 
me nowhere. The only thing I can tell is that both master and slave keep 
running inside the event loop.




[OMPI users] deadlock when calling MPI_gatherv

2010-04-26 Thread Teng Lin
Hi,

We recently ran into deadlock when calling MPI_gatherv with Open MPI 1.3.4. It 
seems to have something to do with sm at first. However, it still hangs even 
after turning off sm btl.

Any idea how to track down the problem?

Thanks,
Teng

#
Stack trace for master node
#
mca_btl_sm_component_progress
opal_progress
opal_condition_wait
ompi_request_default_wait_all
ompi_coll_tuned_sendrecv_actual
ompi_coll_tuned_barrier_intra_two_procs
ompi_coll_tuned_barrier_intra_dec_fixed
mca_coll_sync_gatherv
PMPI_Gatherv


#
Stack trace for slave node
#
mca_btl_sm_component_progress
opal_progress
opal_condition_wait
ompi_request_wait_completion
mca_pml_ob1_recv
mca_coll_basic_gatherv_intra
mca_coll_sync_gatherv


#
Message queue from totalview

MPI_COMM_WORLD
Comm_size2
Comm_rank0
Pending receives: none
Unexpected messages : no information available
Pending sends   : none

MPI_COMM_SELF
Comm_size1
Comm_rank0
Pending receives: none
Unexpected messages : no information available
Pending sends   : none

MPI_COMM_NULL
Comm_size0
Comm_rank   -2
Pending receives: none
Unexpected messages : no information available
Pending sends   : none

MPI COMMUNICATOR 3 DUP FROM 0
Comm_size2
Comm_rank0
Pending receives: none
Unexpected messages : no information available
Pending sends   : none

MPI COMMUNICATOR 4 SPLIT FROM 3
Comm_size2
Comm_rank0
Pending receives: none
Unexpected messages : no information available
Pending sends   : none

MPI COMMUNICATOR 5 SPLIT FROM 4
Comm_size2
Comm_rank0
Pending receives: none
Unexpected messages : no information available
Pending sends   : none

MPI COMMUNICATOR 6 SPLIT FROM 4
Comm_size1
Comm_rank0
Pending receives: none
Unexpected messages : no information available
Pending sends   : none

MPI COMMUNICATOR 7 DUP FROM 4
Comm_size2
Comm_rank0
Pending receives   
[0]
   Receive: 0x80b9000
   Data: 1 * MPI_CHAR
   Status   Pending
   Source   0 (orterun.0)
   Tag  7 (0x0007)
   User Buffer  0xb06fa010 -> 0x (0)
   Buffer Length1359312 (0x0014bdd0)
[1]
   Receive: 0x80b9200
   Data: 1 * MPI_CHAR
   Status   Pending
   Source   0 (orterun.0)
   Tag  5 (0x0005)
   User Buffer  0xb0c2a010 -> 0x (0)
   Buffer Length1359312 (0x0014bdd0)
[2]
   Receive: 0x80b9400
   Data: 1 * MPI_CHAR
   Status   Pending
   Source   1 (orterun.1)
   Tag  3 (0x0003)
   User Buffer  0xb115a010 -> 0xc0ef9e79 (-1058038151)
   Buffer Length1359312 (0x0014bdd0)
[3]
   Receive: 0x80b9600
   Data: 1 * MPI_CHAR
   Status   Pending
   Source   1 (orterun.1)
   Tag  1 (0x0001)
   User Buffer  0xb168a010 -> 0xc0c662aa (-1060740438)
   Buffer Length1359312 (0x0014bdd0)
[4]
   Receive: 0x82a2500
   Data: 1 * MPI_CHAR
   Status   Pending
   Source   0 (orterun.0)
   Tag  11 (0x000b)
   User Buffer  0xafc9a010 -> 0x (0)
   Buffer Length1359312 (0x0014bdd0)
[5]
   Receive: 0x82a2700
   Data: 1 * MPI_CHAR
   Status   Pending
   Source   0 (orterun.0)
   Tag  9 (0x0009)
   User Buffer  0xb01ca010 -> 0x (0)
   Buffer Length1359312 (0x0014bdd0)

Unexpected messages : no information available
Pending sends
[0]
   Send: 0x80b8500
   Data transfer completed
   Status   Complete
   Target   0 (orterun.0)
   Tag  4 (0x0004)
   Buffer   0xb0846010 -> 0x40544279 (1079263865)
   Buffer Length2548 (0x09f4)
[1]
   Send: 0x80b8780
   Data transfer completed
   Status   Complete
   Target   0 (orterun.0)
   Tag  6 (0x0006)
   Buffer   0xb0d76010 -> 0x41a756bf (1101485759)
   Buffer Length2992 (0x0bb0)
[2]
   Send: 0x80b8a00
   Data transfer completed
   Status   Complete
   Target   1 (orterun.1)
   Tag  0 (0x)
   Buffer   0xb12a6010 -> 0xbf94cfc4 (-1080766524)
   Buffer Length3856 (0x0f10)
[3]
   Send: 0x80b8c80
   Data transfer completed
   Status   Complete
   Target   1 (orterun.1)
   Tag  2 (0x0002)
   Buffer   0xb17d6010 -> 0x400a1a6c (1074403948)
   Buffer Length3952 (0x0f70)
[4]
   Send: 0x831f080
   Data transfer completed
   Status   Complete
   Target   0 (orterun.0)
   Tag   

Re: [OMPI users] Bug report in plm_lsf_module.c

2010-04-26 Thread Teng Lin
Ralph,

Thanks for the prompt response.
On Apr 26, 2010, at 2:34 PM, Ralph Castain wrote:

> Appreciate your input! None of the developers have access to an LSF machine 
> any more, so we can't test it :-/
> 
> What version of OMPI does this patch apply to?
The patch is applied to 1.3.4, which is the version we are going to distribute 
our customer next month.

> I can go ahead and add it - just want to know if it should just go to the 
> trunk and 1.5 series, or also the 1.4 series.
I just check the trunk and 1.5. It seems like they also need to be patched.

Thanks,
Teng


[OMPI users] Bug report in plm_lsf_module.c

2010-04-26 Thread Teng Lin
Hi,

We recently identify a bug in our LSF cluster.
The job always hang if all LSF related components present. One observation we 
have is that the job works fine after removing all LSF related components. 

Below message from stdout:
[:24930] mca: base: components_open: Looking for ess components
[:24930] mca: base: components_open: opening ess components
[:24930] mca: base: components_open: found loaded component env
[:24930] mca: base: components_open: component env has no register function
[:24930] mca: base: components_open: component env open function successful
[:24930] mca: base: components_open: found loaded component hnp
[:24930] mca: base: components_open: component hnp has no register function
[:24930] mca: base: components_open: component hnp open function successful
[:24930] mca: base: components_open: found loaded component lsf
[:24930] mca: base: components_open: component lsf has no register function
[:24930] mca: base: components_open: component lsf open function successful
[:24930] mca: base: components_open: found loaded component singleton
[:24930] mca: base: components_open: component singleton has no register 
function
[:24930] mca: base: components_open: component singleton open function 
successful
[:24930] mca: base: components_open: found loaded component slurm
[:24930] mca: base: components_open: component slurm has no register 
function
[:24930] mca: base: components_open: component slurm open function 
successful
[:24930] mca: base: components_open: found loaded component tool
[:24930] mca: base: components_open: component tool has no register function
[:24930] mca: base: components_open: component tool open function successful
[:24930] mca: base: components_open: Looking for plm components
[:24930] mca: base: components_open: opening plm components
[:24930] mca: base: components_open: found loaded component lsf
[:24930] mca: base: components_open: component lsf has no register function
[:24930] mca: base: components_open: component lsf open function successful
[:24930] mca: base: components_open: found loaded component rsh
[:24930] mca: base: components_open: component rsh has no register function
[:24930] mca: base: components_open: component rsh open function successful
[:24930] mca: base: components_open: found loaded component slurm
[:24930] mca: base: components_open: component slurm has no register 
function
[:24930] mca: base: components_open: component slurm open function 
successful
[:24930] mca:base:select: Auto-selecting plm components
[:24930] mca:base:select:(  plm) Querying component [lsf]
[:24930] mca:base:select:(  plm) Query of component [lsf] set priority to 75
[:24930] mca:base:select:(  plm) Querying component [rsh]
[:24930] mca:base:select:(  plm) Query of component [rsh] set priority to 10
[:24930] mca:base:select:(  plm) Querying component [slurm]
[:24930] mca:base:select:(  plm) Skipping component [slurm]. Query failed 
to return a module
[:24930] mca:base:select:(  plm) Selected component [lsf]
[:24930] mca: base: close: component rsh closed
[:24930] mca: base: close: unloading component rsh
[:24930] mca: base: close: component slurm closed
[:24930] mca: base: close: unloading component slurm
[:24930] mca: base: components_open: Looking for rml components
[:24930] mca: base: components_open: opening rml components
[:24930] mca: base: components_open: found loaded component oob
[:24930] mca: base: components_open: component oob has no register function
[:24930] mca: base: components_open: Looking for oob components
[:24930] mca: base: components_open: opening oob components
[:24930] mca: base: components_open: found loaded component tcp
[:24930] mca: base: components_open: component tcp has no register function
[:24930] mca: base: components_open: component tcp open function successful
[:24930] mca: base: components_open: component oob open function successful
[:24930] orte_rml_base_select: initializing rml component oob
[:24930] mca: base: components_open: Looking for ras components
[:24930] mca: base: components_open: opening ras components
[:24930] mca: base: components_open: found loaded component lsf
[:24930] mca: base: components_open: component lsf has no register function
[:24930] mca: base: components_open: component lsf open function successful
[:24930] mca: base: components_open: found loaded component slurm
[:24930] mca: base: components_open: component slurm has no register 
function
[:24930] mca: base: components_open: component slurm open function 
successful
[:24930] mca:base:select: Auto-selecting ras components
[:24930] mca:base:select:(  ras) Querying component [lsf]
[:24930] mca:base:select:(  ras) Query of component [lsf] set priority to 75
[:24930] mca:base:select:(  

[OMPI users] OPAL_PREFIX is not passed to remote node in pls_rsh_module.c

2008-10-17 Thread Teng Lin

Hi All,

We have bundled Open MPI with our product and shipped it to the  
customer. According to http://www.open-mpi.org/faq/?category=building#installdirs 
,


Below is the command we used to launch MPI program:
env OPAL_PREFIX=/path/to/openmpi \
/path/to/openmpi/bin//orterun --prefix /path/to/openmpi -x PATH -x  
LD_LIBRARY_PATH -x OPAL_PREFIX -np 2 --host host1,host2 ring_c


The interesting fact is that it always works on csh/tcsh. But  quite a  
few users told us that they runs into below errors:


[compute-28-1.local:11174] [NO-NAME] ORTE_ERROR_LOG: Not found in file
runtime/orte_init_stage1.c at line 182

--
Sorry!  You were supposed to get help about:
  orte_init:startup:internal-failure
from the file:
  help-orte-runtime
But I couldn't find any file matching that name.  Sorry!

--
[compute-28-1.local:11174] [NO-NAME] ORTE_ERROR_LOG: Not found in file
runtime/orte_system_init.c at line 42
[compute-28-1.local:11174] [NO-NAME] ORTE_ERROR_LOG: Not found in file
runtime/orte_init.c at line 52

--
Sorry!  You were supposed to get help about:
  orted:init-failure
from the file:
  help-orted.txt
But I couldn't find any file matching that name.  Sorry!


Jeff did mention in http://www.open-mpi.org/community/lists/users/2008/09/6582.php 
 that OPAL_PREFIX was propagated for him automatically. I bet Jeff  
uses csh/tcsh.

Anyway, it can be traced back to how the daemon is launched.

sh/bash:

[x:25369] pls:rsh: executing: (//usr/bin/ssh) /usr/bin/ssh x
OPAL_PREFIX=/opt/openmpi-1.2.4 ;
PATH=/opt/openmpi-1.2.4/bin:$PATH
; export PATH ;
LD_LIBRARY_PATH=/opt/openmpi-1.2.4/lib:$LD_LIBRARY_PATH ; export  
LD_LIBRARY_PATH ;


csh/tcsh:
[x:09886] pls:rsh: executing: (//usr/bin/ssh) /usr/bin/ssh x
setenv OPAL_PREFIX /opt/openmpi-1.2.4 ;


It seems to work after I patched pls_rsh_module.c


--- pls_rsh_module.c.orig   2008-10-16 17:15:32.0 -0400
+++ pls_rsh_module.c2008-10-16 17:15:51.0 -0400
@@ -989,7 +989,7 @@
  "%s/%s/%s",
  (opal_prefix != NULL ?  
"OPAL_PREFIX=" : ""),
  (opal_prefix != NULL ?  
opal_prefix : ""),

-  (opal_prefix != NULL ? " ;" : ""),
+  (opal_prefix != NULL ? " ; export  
OPAL_PREFIX ; " : ""),

  prefix_dir, bin_base,
  prefix_dir, lib_base,
  prefix_dir, bin_base,

Another workaround is to add
export OPAL_PREFIX
into $HOME/.bashrc.

Jeff, is this a bug in the code? Or  there is a reason that  
OPAL_PREFIX is not exported for sh/bash?


Teng


[OMPI users] 32-bit openib btl fails on 64-bit OS

2008-04-06 Thread Teng Lin

Dear All,

In order to run a 32-bit program on a 64-bit cluster, one has to build  
32-bit OpenMPI. Following some instructions on this mailing list, I  
successfully built OpenMPI 1.2.4 on 64-bit OS. However, I run into  
openib problem when I try to run hello_c program. I also built 64-bit  
OpenMPI from same source. The interesting fact is 64-bit OpenMPI works  
just fine. Below is the output from orterun,



iceland:/home/tlin/test_pbs>/home/tin/openmpi-1.2.4/bin/orterun -np 2  
--hostfile mach.lst /home/tlin/test_pbs/hello_c.32

--
The OpenIB BTL failed to initialize while trying to create an internal
queue.  This typically indicates a failed OpenFabrics installation or
faulty hardware.  The failure occured here:

   Host:cl1n004
   OMPI source: btl_openib.c:828
   Function:ibv_create_cq()
   Error:   Invalid argument (errno=22)
   Device:  mthca0

You may need to consult with your system administrator to get this
problem fixed.
--
--
The OpenIB BTL failed to initialize while trying to create an internal
queue.  This typically indicates a failed OpenFabrics installation or
faulty hardware.  The failure occured here:

   Host:cl1n001
   OMPI source: btl_openib.c:828
   Function:ibv_create_cq()
   Error:   Invalid argument (errno=22)
   Device:  mthca0

You may need to consult with your system administrator to get this
problem fixed.
--
--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or  
environment

problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

 PML add procs failed
 --> Returned "Error" (-1) instead of "Success" (0)
--
--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or  
environment

problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

 PML add procs failed
 --> Returned "Error" (-1) instead of "Success" (0)
--
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)
##
I saw this error on before on other cluster. Following the instruction  
on (http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages)  
does fix the problem. However, I doubt that is the reason why 32 bit  
OpenMPI does not work on this cluster. Output from limit looks fine to  
me. And if that is the case, 64 bit OpenMPI will not work.  Any ideas?



Thanks,
Teng


[OMPI users] Job does not quit even when the simulation dies

2007-11-06 Thread Teng Lin

Hi,


Just realize I have a job run for a long time, while some of the nodes  
already die. Is there any way to ask other nodes to quit ?



[kyla-0-1.local:09741] mca_btl_tcp_frag_send: writev failed with  
errno=104
[kyla-0-1.local:09742] mca_btl_tcp_frag_send: writev failed with  
errno=104


The FAQ does mention it is related  to :
 Connection reset by peer: These types of errors usually occur after  
MPI_INIT has completed, and typically indicate that an MPI process has  
died unexpectedly (e.g., due to a seg fault). The specific error  
message indicates that a peer MPI process tried to write to the now- 
dead MPI process and failed.


Thanks,
Teng


[OMPI users] Bundling OpenMPI

2007-09-27 Thread Teng Lin

Hi,


We would like to distribute OpenMPI along with our software  to  
customers, is there any legal issue we need to know about?


We can successfully build OpenMPI using
./configure --prefix=/some_path;make;make install

However, if we do

cp -r /some_path /other_path

and try to run
/other_path/bin/orterun,
below error message is thrown:
 
--

Sorry!  You were supposed to get help about:
orterun:usage
from the file:
help-orterun.txt
But I couldn't find any file matching that name.  Sorry!
 
--


Apparently, the path is hard-coded in the executable. Is there any  
way to fix it (such as using an environment variable etc)?



Thanks,
Teng