[OMPI users] OpenMPI failed when running across two mac machines
Hi, We are distributing OpenMPI as part of software suite. Therefore, the prefix we used for building is not expected to be the same when running on customer's machine. However, we did manage to get it running by setting OPLA_PREFIX, PATH and LD_LIBARAY_PATH on Linux). We tried do the same thing on Mac by using DYLD_LIBRARY_PATH instead of LD_LIBARAY_PATH. Unfortunately, we run into below error: dyld: lazy symbol binding failed: Symbol not found: _orte_daemon After looking at orte/mca/plm/rsh/plm_rsh_module.c, we realized that the problem is DYLD_LIBARAY_PATH is not setting before launching orted. Instead plm_rsh_module.c still tries to set LD_LIBRARY_PATH. We had a patch for this and it seems to address the issue. We will be thrilled if the attached patch will be accepted. Teng plm_rsh_module.c.patch Description: Binary data
[OMPI users] OpenMPI failed when running across two mac machines
Hi, We are distributing OpenMPI as part of software suite. Therefore, the prefix we used for building is not expected to be the same when running on customer's machine. However, we did manage to get it running by setting OPLA_PREFIX, PATH and LD_LIBARAY_PATH on Linux). We tried do the same thing on Mac by using DYLD_LIBRARY_PATH instead of LD_LIBARAY_PATH. Unfortunately, we run into below error: dyld: lazy symbol binding failed: Symbol not found: _orte_daemon After looking at orte/mca/plm/rsh/plm_rsh_module.c, we realized that the problem is DYLD_LIBARAY_PATH is not setting before launching orted. Instead plm_rsh_module.c still tries to set LD_LIBRARY_PATH. We had a patch for this and it seems to address the issue. We will be thrilled if the attached patch will be accepted. Teng plm_rsh_module.c.patch Description: Binary data
Re: [OMPI users] deadlock when calling MPI_gatherv
Hi Terry, > How does the stack for the non-SM BTL run look, I assume it probably is the > same? Also, can you dump the message queues for rank 1? What's interesting > is you have a bunch of pending receives, do you expect that to be the case > when the MPI_Gatherv occurred? It turns out we have an unbalanced MPI_Bcast buried very deep in the application. After fixing that bug, the application behaves correctly. Thank you all for the help, and sorry for the false alarm. Teng
Re: [OMPI users] deadlock when calling MPI_gatherv
On Apr 26, 2010, at 9:07 PM, Trent Creekmore wrote: > You are going to have to debug and trace the program to find out where it is > stopping. > You may want to try using KDbg, a graphical front end for the command line > debugger dbg, which makes it a LOT easier, or use Eclipse. As a matter of fact, I did use a debugger. Totalview in this case. But it gets me nowhere. The only thing I can tell is that both master and slave keep running inside the event loop.
[OMPI users] deadlock when calling MPI_gatherv
Hi, We recently ran into deadlock when calling MPI_gatherv with Open MPI 1.3.4. It seems to have something to do with sm at first. However, it still hangs even after turning off sm btl. Any idea how to track down the problem? Thanks, Teng # Stack trace for master node # mca_btl_sm_component_progress opal_progress opal_condition_wait ompi_request_default_wait_all ompi_coll_tuned_sendrecv_actual ompi_coll_tuned_barrier_intra_two_procs ompi_coll_tuned_barrier_intra_dec_fixed mca_coll_sync_gatherv PMPI_Gatherv # Stack trace for slave node # mca_btl_sm_component_progress opal_progress opal_condition_wait ompi_request_wait_completion mca_pml_ob1_recv mca_coll_basic_gatherv_intra mca_coll_sync_gatherv # Message queue from totalview MPI_COMM_WORLD Comm_size2 Comm_rank0 Pending receives: none Unexpected messages : no information available Pending sends : none MPI_COMM_SELF Comm_size1 Comm_rank0 Pending receives: none Unexpected messages : no information available Pending sends : none MPI_COMM_NULL Comm_size0 Comm_rank -2 Pending receives: none Unexpected messages : no information available Pending sends : none MPI COMMUNICATOR 3 DUP FROM 0 Comm_size2 Comm_rank0 Pending receives: none Unexpected messages : no information available Pending sends : none MPI COMMUNICATOR 4 SPLIT FROM 3 Comm_size2 Comm_rank0 Pending receives: none Unexpected messages : no information available Pending sends : none MPI COMMUNICATOR 5 SPLIT FROM 4 Comm_size2 Comm_rank0 Pending receives: none Unexpected messages : no information available Pending sends : none MPI COMMUNICATOR 6 SPLIT FROM 4 Comm_size1 Comm_rank0 Pending receives: none Unexpected messages : no information available Pending sends : none MPI COMMUNICATOR 7 DUP FROM 4 Comm_size2 Comm_rank0 Pending receives [0] Receive: 0x80b9000 Data: 1 * MPI_CHAR Status Pending Source 0 (orterun.0) Tag 7 (0x0007) User Buffer 0xb06fa010 -> 0x (0) Buffer Length1359312 (0x0014bdd0) [1] Receive: 0x80b9200 Data: 1 * MPI_CHAR Status Pending Source 0 (orterun.0) Tag 5 (0x0005) User Buffer 0xb0c2a010 -> 0x (0) Buffer Length1359312 (0x0014bdd0) [2] Receive: 0x80b9400 Data: 1 * MPI_CHAR Status Pending Source 1 (orterun.1) Tag 3 (0x0003) User Buffer 0xb115a010 -> 0xc0ef9e79 (-1058038151) Buffer Length1359312 (0x0014bdd0) [3] Receive: 0x80b9600 Data: 1 * MPI_CHAR Status Pending Source 1 (orterun.1) Tag 1 (0x0001) User Buffer 0xb168a010 -> 0xc0c662aa (-1060740438) Buffer Length1359312 (0x0014bdd0) [4] Receive: 0x82a2500 Data: 1 * MPI_CHAR Status Pending Source 0 (orterun.0) Tag 11 (0x000b) User Buffer 0xafc9a010 -> 0x (0) Buffer Length1359312 (0x0014bdd0) [5] Receive: 0x82a2700 Data: 1 * MPI_CHAR Status Pending Source 0 (orterun.0) Tag 9 (0x0009) User Buffer 0xb01ca010 -> 0x (0) Buffer Length1359312 (0x0014bdd0) Unexpected messages : no information available Pending sends [0] Send: 0x80b8500 Data transfer completed Status Complete Target 0 (orterun.0) Tag 4 (0x0004) Buffer 0xb0846010 -> 0x40544279 (1079263865) Buffer Length2548 (0x09f4) [1] Send: 0x80b8780 Data transfer completed Status Complete Target 0 (orterun.0) Tag 6 (0x0006) Buffer 0xb0d76010 -> 0x41a756bf (1101485759) Buffer Length2992 (0x0bb0) [2] Send: 0x80b8a00 Data transfer completed Status Complete Target 1 (orterun.1) Tag 0 (0x) Buffer 0xb12a6010 -> 0xbf94cfc4 (-1080766524) Buffer Length3856 (0x0f10) [3] Send: 0x80b8c80 Data transfer completed Status Complete Target 1 (orterun.1) Tag 2 (0x0002) Buffer 0xb17d6010 -> 0x400a1a6c (1074403948) Buffer Length3952 (0x0f70) [4] Send: 0x831f080 Data transfer completed Status Complete Target 0 (orterun.0) Tag
Re: [OMPI users] Bug report in plm_lsf_module.c
Ralph, Thanks for the prompt response. On Apr 26, 2010, at 2:34 PM, Ralph Castain wrote: > Appreciate your input! None of the developers have access to an LSF machine > any more, so we can't test it :-/ > > What version of OMPI does this patch apply to? The patch is applied to 1.3.4, which is the version we are going to distribute our customer next month. > I can go ahead and add it - just want to know if it should just go to the > trunk and 1.5 series, or also the 1.4 series. I just check the trunk and 1.5. It seems like they also need to be patched. Thanks, Teng
[OMPI users] Bug report in plm_lsf_module.c
Hi, We recently identify a bug in our LSF cluster. The job always hang if all LSF related components present. One observation we have is that the job works fine after removing all LSF related components. Below message from stdout: [:24930] mca: base: components_open: Looking for ess components [:24930] mca: base: components_open: opening ess components [:24930] mca: base: components_open: found loaded component env [:24930] mca: base: components_open: component env has no register function [:24930] mca: base: components_open: component env open function successful [:24930] mca: base: components_open: found loaded component hnp [:24930] mca: base: components_open: component hnp has no register function [:24930] mca: base: components_open: component hnp open function successful [:24930] mca: base: components_open: found loaded component lsf [:24930] mca: base: components_open: component lsf has no register function [:24930] mca: base: components_open: component lsf open function successful [:24930] mca: base: components_open: found loaded component singleton [:24930] mca: base: components_open: component singleton has no register function [:24930] mca: base: components_open: component singleton open function successful [:24930] mca: base: components_open: found loaded component slurm [:24930] mca: base: components_open: component slurm has no register function [:24930] mca: base: components_open: component slurm open function successful [:24930] mca: base: components_open: found loaded component tool [:24930] mca: base: components_open: component tool has no register function [:24930] mca: base: components_open: component tool open function successful [:24930] mca: base: components_open: Looking for plm components [:24930] mca: base: components_open: opening plm components [:24930] mca: base: components_open: found loaded component lsf [:24930] mca: base: components_open: component lsf has no register function [:24930] mca: base: components_open: component lsf open function successful [:24930] mca: base: components_open: found loaded component rsh [:24930] mca: base: components_open: component rsh has no register function [:24930] mca: base: components_open: component rsh open function successful [:24930] mca: base: components_open: found loaded component slurm [:24930] mca: base: components_open: component slurm has no register function [:24930] mca: base: components_open: component slurm open function successful [:24930] mca:base:select: Auto-selecting plm components [:24930] mca:base:select:( plm) Querying component [lsf] [:24930] mca:base:select:( plm) Query of component [lsf] set priority to 75 [:24930] mca:base:select:( plm) Querying component [rsh] [:24930] mca:base:select:( plm) Query of component [rsh] set priority to 10 [:24930] mca:base:select:( plm) Querying component [slurm] [:24930] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a module [:24930] mca:base:select:( plm) Selected component [lsf] [:24930] mca: base: close: component rsh closed [:24930] mca: base: close: unloading component rsh [:24930] mca: base: close: component slurm closed [:24930] mca: base: close: unloading component slurm [:24930] mca: base: components_open: Looking for rml components [:24930] mca: base: components_open: opening rml components [:24930] mca: base: components_open: found loaded component oob [:24930] mca: base: components_open: component oob has no register function [:24930] mca: base: components_open: Looking for oob components [:24930] mca: base: components_open: opening oob components [:24930] mca: base: components_open: found loaded component tcp [:24930] mca: base: components_open: component tcp has no register function [:24930] mca: base: components_open: component tcp open function successful [:24930] mca: base: components_open: component oob open function successful [:24930] orte_rml_base_select: initializing rml component oob [:24930] mca: base: components_open: Looking for ras components [:24930] mca: base: components_open: opening ras components [:24930] mca: base: components_open: found loaded component lsf [:24930] mca: base: components_open: component lsf has no register function [:24930] mca: base: components_open: component lsf open function successful [:24930] mca: base: components_open: found loaded component slurm [:24930] mca: base: components_open: component slurm has no register function [:24930] mca: base: components_open: component slurm open function successful [:24930] mca:base:select: Auto-selecting ras components [:24930] mca:base:select:( ras) Querying component [lsf] [:24930] mca:base:select:( ras) Query of component [lsf] set priority to 75 [:24930] mca:base:select:(
[OMPI users] OPAL_PREFIX is not passed to remote node in pls_rsh_module.c
Hi All, We have bundled Open MPI with our product and shipped it to the customer. According to http://www.open-mpi.org/faq/?category=building#installdirs , Below is the command we used to launch MPI program: env OPAL_PREFIX=/path/to/openmpi \ /path/to/openmpi/bin//orterun --prefix /path/to/openmpi -x PATH -x LD_LIBRARY_PATH -x OPAL_PREFIX -np 2 --host host1,host2 ring_c The interesting fact is that it always works on csh/tcsh. But quite a few users told us that they runs into below errors: [compute-28-1.local:11174] [NO-NAME] ORTE_ERROR_LOG: Not found in file runtime/orte_init_stage1.c at line 182 -- Sorry! You were supposed to get help about: orte_init:startup:internal-failure from the file: help-orte-runtime But I couldn't find any file matching that name. Sorry! -- [compute-28-1.local:11174] [NO-NAME] ORTE_ERROR_LOG: Not found in file runtime/orte_system_init.c at line 42 [compute-28-1.local:11174] [NO-NAME] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at line 52 -- Sorry! You were supposed to get help about: orted:init-failure from the file: help-orted.txt But I couldn't find any file matching that name. Sorry! Jeff did mention in http://www.open-mpi.org/community/lists/users/2008/09/6582.php that OPAL_PREFIX was propagated for him automatically. I bet Jeff uses csh/tcsh. Anyway, it can be traced back to how the daemon is launched. sh/bash: [x:25369] pls:rsh: executing: (//usr/bin/ssh) /usr/bin/ssh x OPAL_PREFIX=/opt/openmpi-1.2.4 ; PATH=/opt/openmpi-1.2.4/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/opt/openmpi-1.2.4/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; csh/tcsh: [x:09886] pls:rsh: executing: (//usr/bin/ssh) /usr/bin/ssh x setenv OPAL_PREFIX /opt/openmpi-1.2.4 ; It seems to work after I patched pls_rsh_module.c --- pls_rsh_module.c.orig 2008-10-16 17:15:32.0 -0400 +++ pls_rsh_module.c2008-10-16 17:15:51.0 -0400 @@ -989,7 +989,7 @@ "%s/%s/%s", (opal_prefix != NULL ? "OPAL_PREFIX=" : ""), (opal_prefix != NULL ? opal_prefix : ""), - (opal_prefix != NULL ? " ;" : ""), + (opal_prefix != NULL ? " ; export OPAL_PREFIX ; " : ""), prefix_dir, bin_base, prefix_dir, lib_base, prefix_dir, bin_base, Another workaround is to add export OPAL_PREFIX into $HOME/.bashrc. Jeff, is this a bug in the code? Or there is a reason that OPAL_PREFIX is not exported for sh/bash? Teng
[OMPI users] 32-bit openib btl fails on 64-bit OS
Dear All, In order to run a 32-bit program on a 64-bit cluster, one has to build 32-bit OpenMPI. Following some instructions on this mailing list, I successfully built OpenMPI 1.2.4 on 64-bit OS. However, I run into openib problem when I try to run hello_c program. I also built 64-bit OpenMPI from same source. The interesting fact is 64-bit OpenMPI works just fine. Below is the output from orterun, iceland:/home/tlin/test_pbs>/home/tin/openmpi-1.2.4/bin/orterun -np 2 --hostfile mach.lst /home/tlin/test_pbs/hello_c.32 -- The OpenIB BTL failed to initialize while trying to create an internal queue. This typically indicates a failed OpenFabrics installation or faulty hardware. The failure occured here: Host:cl1n004 OMPI source: btl_openib.c:828 Function:ibv_create_cq() Error: Invalid argument (errno=22) Device: mthca0 You may need to consult with your system administrator to get this problem fixed. -- -- The OpenIB BTL failed to initialize while trying to create an internal queue. This typically indicates a failed OpenFabrics installation or faulty hardware. The failure occured here: Host:cl1n001 OMPI source: btl_openib.c:828 Function:ibv_create_cq() Error: Invalid argument (errno=22) Device: mthca0 You may need to consult with your system administrator to get this problem fixed. -- -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): PML add procs failed --> Returned "Error" (-1) instead of "Success" (0) -- -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): PML add procs failed --> Returned "Error" (-1) instead of "Success" (0) -- *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (goodbye) *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (goodbye) ## I saw this error on before on other cluster. Following the instruction on (http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages) does fix the problem. However, I doubt that is the reason why 32 bit OpenMPI does not work on this cluster. Output from limit looks fine to me. And if that is the case, 64 bit OpenMPI will not work. Any ideas? Thanks, Teng
[OMPI users] Job does not quit even when the simulation dies
Hi, Just realize I have a job run for a long time, while some of the nodes already die. Is there any way to ask other nodes to quit ? [kyla-0-1.local:09741] mca_btl_tcp_frag_send: writev failed with errno=104 [kyla-0-1.local:09742] mca_btl_tcp_frag_send: writev failed with errno=104 The FAQ does mention it is related to : Connection reset by peer: These types of errors usually occur after MPI_INIT has completed, and typically indicate that an MPI process has died unexpectedly (e.g., due to a seg fault). The specific error message indicates that a peer MPI process tried to write to the now- dead MPI process and failed. Thanks, Teng
[OMPI users] Bundling OpenMPI
Hi, We would like to distribute OpenMPI along with our software to customers, is there any legal issue we need to know about? We can successfully build OpenMPI using ./configure --prefix=/some_path;make;make install However, if we do cp -r /some_path /other_path and try to run /other_path/bin/orterun, below error message is thrown: -- Sorry! You were supposed to get help about: orterun:usage from the file: help-orterun.txt But I couldn't find any file matching that name. Sorry! -- Apparently, the path is hard-coded in the executable. Is there any way to fix it (such as using an environment variable etc)? Thanks, Teng