FWIW I could observe some memory leaks on both mpirun and MPI task 0 with the 
latest master branch.

So I guess mileage varies depending on available RAM and number of iterations.

Sent from my iPod

> On Mar 17, 2019, at 20:47, Riebs, Andy <andy.ri...@hpe.com> wrote:
> 
> Thomas, your test case is somewhat similar to a bash fork() bomb -- not the 
> same, but similar. After running one of your failing jobs, you might check to 
> see if the “out-of-memory” (“OOM”) killer has been invoked. If it has, that 
> can lead to unexpected consequences, such as what you’ve reported.
>  
> An easy way to check would be
> $ nodes=${ job’s node list }
> $  pdsh  -w $nodes dmesg  -T \|  grep  \"Out of memory\" 2>/dev/null
>  
> Andy
>  
> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Thomas Pak
> Sent: Saturday, March 16, 2019 4:14 PM
> To: Open MPI Users <users@lists.open-mpi.org>
> Cc: Open MPI Users <users@lists.open-mpi.org>
> Subject: Re: [OMPI users] MPI_Comm_spawn leads to pipe leak and other errors
>  
> Dear Jeff,
>  
> I did find a way to circumvent this issue for my specific application by 
> spawning less frequently. However, I wanted to at least bring attention to 
> this issue for the OpenMPI community, as it can be reproduced with an 
> alarmingly simple program.
>  
> Perhaps the user's mailing list is not the ideal place for this. Would you 
> recommend that I report this issue on the developer's mailing list or open a 
> GitHub issue?
>  
> Best wishes,
> Thomas Pak
>  
> On Mar 16 2019, at 7:40 pm, Jeff Hammond <jeff.scie...@gmail.com> wrote:
> Is there perhaps a different way to solve your problem that doesn’t spawn so 
> much as to hit this issue?
>  
> I’m not denying there’s an issue here, but in a world of finite human effort 
> and fallible software, sometimes it’s easiest to just avoid the bugs 
> altogether.
>  
> Jeff
>  
> On Sat, Mar 16, 2019 at 12:11 PM Thomas Pak <thomas....@maths.ox.ac.uk> wrote:
> Dear all,
>  
> Does anyone have any clue on what the problem could be here? This seems to be 
> a persistent problem present in all currently supported OpenMPI releases and 
> indicates that there is a fundamental flaw in how OpenMPI handles dynamic 
> process creation.
>  
> Best wishes,
> Thomas Pak
>  
> From: "Thomas Pak" <thomas....@maths.ox.ac.uk>
> To: users@lists.open-mpi.org
> Sent: Friday, 7 December, 2018 17:51:29
> Subject: [OMPI users] MPI_Comm_spawn leads to pipe leak and other errors
>  
> Dear all,
>  
> My MPI application spawns a large number of MPI processes using 
> MPI_Comm_spawn over its total lifetime. Unfortunately, I have experienced 
> that this results in problems for all currently supported OpenMPI versions 
> (2.1, 3.0, 3.1 and 4.0). I have written a short, self-contained program in C 
> (included below) that spawns child processes using MPI_Comm_spawn in an 
> infinite loop, where each child process exits after writing a message to 
> stdout. This short program leads to the following issues:
>  
> In versions 2.1.2 (Ubuntu package) and 2.1.5 (compiled from source), the 
> program leads to a pipe leak where pipes keep accumulating over time until my 
> MPI application crashes because the maximum number of pipes has been reached.
>  
> In versions 3.0.3 and 3.1.3 (both compiled from source), there appears to be 
> no pipe leak, but the program crashes with the following error message:
> PMIX_ERROR: UNREACHABLE in file ptl_tcp_component.c at line 1257
>  
> In version 4.0.0 (compiled from source), I have not been able to test this 
> issue very thoroughly because mpiexec ignores the --oversubscribe 
> command-line flag (as detailed in this GitHub issue 
> https://github.com/open-mpi/ompi/issues/6130). This prohibits the 
> oversubscription of processor cores, which means that spawning additional 
> processes immediately results in an error because "not enough slots" are 
> available. A fix for this was proposed recently 
> (https://github.com/open-mpi/ompi/pull/6139), but since the v4.0.x developer 
> branch is being actively developed right now, I decided not go into it.
>  
> I have found one e-mail thread on this mailing list about a similar problem 
> (https://www.mail-archive.com/users@lists.open-mpi.org/msg10543.html). In 
> this thread, Ralph Castain states that this is a known issue and suggests 
> that it is fixed in the then upcoming v1.3.x release. However, version 1.3 is 
> no longer supported and the issue has reappeared, hence this did not solve 
> the issue.
>  
> I have created a GitHub gist that contains the output from "ompi_info --all" 
> of all the OpenMPI installations mentioned here, as well as the config.log 
> files for the OpenMPI installations that I compiled from source: 
> https://gist.github.com/ThomasPak/1003160e396bb88dff27e53c53121e0c.
>  
> I have also attached the code for the short program that demonstrates these 
> issues. For good measure, I have included it directly here as well:
>  
> """
> #include <stdio.h>
> #include <mpi.h>
>  
> int main(int argc, char *argv[]) {
>  
> // Initialize MPI
> MPI_Init(NULL, NULL);
>  
> // Get parent
> MPI_Comm parent;
> MPI_Comm_get_parent(&parent);
>  
> // If the process was not spawned
> if (parent == MPI_COMM_NULL) {
>  
> puts("I was not spawned!");
>  
> // Spawn child process in loop
> char *cmd = argv[0];
> char **cmd_argv = MPI_ARGV_NULL;
> int maxprocs = 1;
> MPI_Info info = MPI_INFO_NULL;
> int root = 0;
> MPI_Comm comm = MPI_COMM_SELF;
> MPI_Comm intercomm;
> int *array_of_errcodes = MPI_ERRCODES_IGNORE;
>  
> for (;;) {
> MPI_Comm_spawn(cmd, cmd_argv, maxprocs, info, root, comm,
> &intercomm, array_of_errcodes);
>  
> MPI_Comm_disconnect(&intercomm);
> }
>  
> // If process was spawned
> } else {
>  
> puts("I was spawned!");
>  
> MPI_Comm_disconnect(&parent);
> }
>  
> // Finalize
> MPI_Finalize();
>  
> }
> """
>  
> Thanks in advance and best wishes,
> Thomas Pak
>  
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
> --
> Jeff Hammond
> jeff.scie...@gmail.com
> http://jeffhammond.github.io/
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to