HI Martin,

Thanks this is helpful.  Are you getting this timeout when you're running
the spawner process as a singleton?

Howard

Am Fr., 14. Aug. 2020 um 17:44 Uhr schrieb Martín Morales <
martineduardomora...@hotmail.com>:

> Howard,
>
>
>
> I pasted below, the error message after a while of the hang I referred.
>
> Regards,
>
>
>
> Martín
>
>
>
> -
>
>
>
> *A request has timed out and will therefore fail:*
>
>
>
> *  Operation:  LOOKUP: orted/pmix/pmix_server_pub.c:345*
>
>
>
> *Your job may terminate as a result of this problem. You may want to*
>
> *adjust the MCA parameter pmix_server_max_wait and try again. If this*
>
> *occurred during a connect/accept operation, you can adjust that time*
>
> *using the pmix_base_exchange_timeout parameter.*
>
>
> *--------------------------------------------------------------------------*
>
>
> *--------------------------------------------------------------------------*
>
> *It looks like MPI_INIT failed for some reason; your parallel process is*
>
> *likely to abort.  There are many reasons that a parallel process can*
>
> *fail during MPI_INIT; some of which are due to configuration or
> environment*
>
> *problems.  This failure appears to be an internal failure; here's some*
>
> *additional information (which may only be relevant to an Open MPI*
>
> *developer):*
>
>
>
> *  ompi_dpm_dyn_init() failed*
>
> *  --> Returned "Timeout" (-15) instead of "Success" (0)*
>
>
> *--------------------------------------------------------------------------*
>
> *[nos-GF7050VT-M:03767] *** An error occurred in MPI_Init*
>
> *[nos-GF7050VT-M:03767] *** reported by process [2337734658,0]*
>
> *[nos-GF7050VT-M:03767] *** on a NULL communicator*
>
> *[nos-GF7050VT-M:03767] *** Unknown error*
>
> *[nos-GF7050VT-M:03767] *** MPI_ERRORS_ARE_FATAL (processes in this
> communicator will now abort,*
>
> *[nos-GF7050VT-M:03767] ***    and potentially your MPI job)*
>
> *[osboxes:02457] *** An error occurred in MPI_Comm_spawn*
>
> *[osboxes:02457] *** reported by process [2337734657,0]*
>
> *[osboxes:02457] *** on communicator MPI_COMM_WORLD*
>
> *[osboxes:02457] *** MPI_ERR_UNKNOWN: unknown error*
>
> *[osboxes:02457] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
> will now abort,*
>
> *[osboxes:02457] ***    and potentially your MPI job)*
>
> *[osboxes:02458] 1 more process has sent help message help-orted.txt /
> timedout*
>
> *[osboxes:02458] Set MCA parameter "orte_base_help_aggregate" to 0 to see
> all help / error messages*
>
>
>
>
>
>
>
>
>
> *From: *Martín Morales via users <users@lists.open-mpi.org>
> *Sent: *viernes, 14 de agosto de 2020 19:40
> *To: *Howard Pritchard <hpprit...@gmail.com>
> *Cc: *Martín Morales <martineduardomora...@hotmail.com>; Open MPI Users
> <users@lists.open-mpi.org>
> *Subject: *Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with
> dynamically processes allocation. OMPI 4.0.1 don't.
>
>
>
> Hi Howard.
>
>
>
> Thanks for the track in Github. I have run with mpirun without “master” in
> the hostfile and runs ok. The hanging occurs when I run like a singleton
> (no mpirun) which is the way I need to run. If I make a top in both
> machines the processes are correctly mapped but hangued. Seems the
> MPI_Init() function doesn’t return. Thanks for your help.
>
> Best regards,
>
>
>
> Martín
>
>
>
>
>
>
>
>
>
>
>
>
>
> *From: *Howard Pritchard <hpprit...@gmail.com>
> *Sent: *viernes, 14 de agosto de 2020 15:18
> *To: *Martín Morales <martineduardomora...@hotmail.com>
> *Cc: *Open MPI Users <users@lists.open-mpi.org>
> *Subject: *Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with
> dynamically processes allocation. OMPI 4.0.1 don't.
>
>
>
> Hi Martin,
>
>
>
> I opened an issue on Open MPI's github to track this
> https://github.com/open-mpi/ompi/issues/8005
>
>
>
> You may be seeing another problem if you removed master from the host
> file.
>
> Could you add the --debug-daemons option to the mpirun and post the output?
>
>
>
> Howard
>
>
>
>
>
> Am Di., 11. Aug. 2020 um 17:35 Uhr schrieb Martín Morales <
> martineduardomora...@hotmail.com>:
>
> Hi Howard.
>
>
>
> Great!, that works for the crashing problem with OMPI 4.0.4. However It
> stills hanging if I remove “master” (host which launches spawning
> processes) from my hostfile.
>
> I need spawn only in “worker”. Is there a way or workaround for doing this
> without mpirun?
>
> Thanks a lot for your assistance.
>
>
>
> Martín
>
>
>
>
>
>
>
>
>
> *From: *Howard Pritchard <hpprit...@gmail.com>
> *Sent: *lunes, 10 de agosto de 2020 19:13
> *To: *Martín Morales <martineduardomora...@hotmail.com>
> *Cc: *Open MPI Users <users@lists.open-mpi.org>
> *Subject: *Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with
> dynamically processes allocation. OMPI 4.0.1 don't.
>
>
>
> Hi Martin,
>
>
>
> I was able to reproduce this with 4.0.x branch.  I'll open an issue.
>
>
>
> If you really want to use 4.0.4, then what you'll need to do is build an
> external PMIx 3.1.2 (the PMIx that was embedded in Open MPI 4.0.1), and
> then build Open MPI using the --with-pmix=where your pmix is installed
>
> You will also need to build both Open MPI and PMIx against the same
> libevent.   There's a configure option with both packages to use an
> external libevent installation.
>
>
>
> Howard
>
>
>
>
>
> Am Mo., 10. Aug. 2020 um 13:52 Uhr schrieb Martín Morales <
> martineduardomora...@hotmail.com>:
>
> Hi Howard. Unfortunately the issue persists in OMPI 4.0.5rc1. Do I have
> to post this on the bug section? Thanks and regards.
>
>
>
> Martín
>
>
>
> *From: *Howard Pritchard <hpprit...@gmail.com>
> *Sent: *lunes, 10 de agosto de 2020 14:44
> *To: *Open MPI Users <users@lists.open-mpi.org>
> *Cc: *Martín Morales <martineduardomora...@hotmail.com>
> *Subject: *Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with
> dynamically processes allocation. OMPI 4.0.1 don't.
>
>
>
> Hello Martin,
>
>
>
> Between Open MPI 4.0.1 and Open MPI 4.0.4 we upgraded the internal PMIx
> version that introduced a problem with spawn for the 4.0.2-4.0.4 versions.
>
> This is supposed to be fixed in the 4.0.5 release.  Could you try the
> 4.0.5rc1 tarball and see if that addresses the problem you're seeing?
>
>
>
> https://www.open-mpi.org/software/ompi/v4.0/
>
>
>
> Howard
>
>
>
>
>
>
>
> Am Do., 6. Aug. 2020 um 09:50 Uhr schrieb Martín Morales via users <
> users@lists.open-mpi.org>:
>
>
>
> Hello people!
>
> I'm using OMPI 4.0.4 in a very simple scenario. Just 2 machines, one
> "master", one "worker" on a Ethernet LAN. Both with Ubuntu 18.04.I builded
> OMPI just like this:
>
>
>
> ./configure --prefix=/usr/local/openmpi-4.0.4/bin/
>
>
>
> My hostfile is this:
>
>
>
> master slots=2
> worker slots=2
>
>
>
> I'm trying to dynamically allocate the processes with MPI_Comm_Spawn().
>
> If I launch the processes only on the "master" machine It's ok. But if I
> use the hostfile crashes with this:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *--------------------------------------------------------------------------
> At least one pair of MPI processes are unable to reach each other for MPI
> communications.  This means that no Open MPI device has indicated that it
> can be used to communicate between these processes.  This is an error; Open
> MPI requires that all MPI processes be able to reach each other.  This
> error can sometimes be the result of forgetting to specify the "self" BTL.
>   Process 1 ([[35155,2],1]) is on host: nos-GF7050VT-M   Process 2
> ([[35155,1],0]) is on host: unknown!   BTLs attempted: tcp self Your MPI
> job is now going to abort; sorry.
> --------------------------------------------------------------------------
> [nos-GF7050VT-M:22526] [[35155,2],1] ORTE_ERROR_LOG: Unreachable in file
> dpm/dpm.c at line 493
> --------------------------------------------------------------------------
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can fail
> during MPI_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):   ompi_dpm_dyn_init() failed   --> Returned "Unreachable" (-12)
> instead of "Success" (0)
> --------------------------------------------------------------------------
> [nos-GF7050VT-M:22526] *** An error occurred in MPI_Init
> [nos-GF7050VT-M:22526] *** reported by process [2303918082,1]
> [nos-GF7050VT-M:22526] *** on a NULL communicator [nos-GF7050VT-M:22526]
> *** Unknown error [nos-GF7050VT-M:22526] *** MPI_ERRORS_ARE_FATAL
> (processes in this communicator will now abort, [nos-GF7050VT-M:22526] ***
>    and potentially your MPI job)*
>
>
>
> Note: host "nos-GF7050VT-M" is "worker"
>
>
>
> But If I run without "master" in hostfile, the processes are launched but
> It hangs: MPI_Init() doesn't returns.
>
> I launched the script (pasted below) in this 2 ways with the same result:
>
>
>
> $ ./simple_spawn 2
>
> $ mpirun -np 1 ./simple_spawn 2
>
>
>
> The "simple_spawn" script:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *#include "mpi.h" #include <stdio.h> #include <stdlib.h> int main(int
> argc, char ** argv){     int processesToRun;     MPI_Comm parentcomm,
> intercomm;     MPI_Info info;     int rank, size, hostName_len;     char
> hostName[200];         MPI_Init( &argc, &argv );     MPI_Comm_get_parent(
> &parentcomm );     MPI_Comm_rank(MPI_COMM_WORLD, &rank);
> MPI_Comm_size(MPI_COMM_WORLD, &size);     MPI_Get_processor_name(hostName,
> &hostName_len);     if (parentcomm == MPI_COMM_NULL) {
> if(argc < 2 ){             printf("Processes number needed!");
> return 0;         }         processesToRun = atoi(argv[1]);
> MPI_Info_create( &info );         MPI_Info_set( info, "hostfile",
> "./hostfile" );         MPI_Info_set( info, "map_by", "node" );
>     MPI_Comm_spawn( argv[0], MPI_ARGV_NULL, processesToRun, info, 0,
> MPI_COMM_WORLD, &intercomm, MPI_ERRCODES_IGNORE);         printf("I'm the
> parent.\n");     } else {         printf("I'm the spawned h: %s  r/s:
> %i/%i.\n", hostName, rank, size );     }     fflush(stdout);
> MPI_Finalize();     return 0; }*
>
>
>
> I came from OMPI 4.0.1. In this version It's working... with some
> inconsistencies I'm afraid. That's why I decided to upgrade to OMPI 4.0.4.
>
> I tried several versions with no luck. Is there maybe an intrinsic problem
> with the OMPI dynamic allocation functionality?
>
> Any help will be very appreciated. Best regards.
>
>
>
> Martín
>
>
>
>
>
>
>
>
>
>
>

Reply via email to