I think the root cause was that he expected the negative integer resulting from the reduction to be the exit code of the application, and as I explained in my prior email that's not how exit() works.
The exit() issue aside, MPI_Abort seems to be the right function for this usage. George. On Wed, Jul 19, 2023 at 11:08 AM Jeff Squyres (jsquyres) <jsquy...@cisco.com> wrote: > MPI_Allreduce should work just fine, even with negative numbers. If you > are seeing something different, can you provide a small reproducer program > that shows the problem? We can dig deeper into if if we can reproduce the > problem. > > mpirun's exit status can't distinguish between MPI processes who call > MPI_Finalize and then return a non-zero exit status and those who invoked > MPI_Abort. But if you have 1 process that invokes MPI_Abort with an exit > status <255, it should be reflected in mpirun's exit status. For example: > > $ cat abort.c > > #include <stdio.h> > > #include <mpi.h> > > > int main(int argc, char *argv[]) > > { > > int i, rank, size; > > > MPI_Init(NULL, NULL); > > MPI_Comm_rank(MPI_COMM_WORLD, &rank); > > MPI_Comm_size(MPI_COMM_WORLD, &size); > > > if (rank == size - 1) { > > int err_code = 79; > > fprintf(stderr, "I am rank %d and am aborting with error code > %d\n", > > rank, err_code); > > MPI_Abort(MPI_COMM_WORLD, err_code); > > } > > > fprintf(stderr, "I am rank %d and am exiting with 0\n", rank); > > MPI_Finalize(); > > return 0; > > } > > > $ mpicc abort.c -o abort > > > $ mpirun --host mpi004:2,mpi005:2 -np 4 ./abort > > I am rank 0 and am exiting with 0 > > I am rank 1 and am exiting with 0 > > I am rank 2 and am exiting with 0 > > I am rank 3 and am aborting with error code 79 > > -------------------------------------------------------------------------- > > MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD > > with errorcode 79. > > > NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. > > You may or may not see output from other processes, depending on > > exactly when Open MPI kills them. > > -------------------------------------------------------------------------- > > > $ echo $? > > 79 > > ------------------------------ > *From:* users <users-boun...@lists.open-mpi.org> on behalf of Alexander > Stadik via users <users@lists.open-mpi.org> > *Sent:* Wednesday, July 19, 2023 12:45 AM > *To:* George Bosilca <bosi...@icl.utk.edu>; Open MPI Users < > users@lists.open-mpi.org> > *Cc:* Alexander Stadik <alexander.sta...@essteyr.com> > *Subject:* Re: [OMPI users] [EXT] Re: Error handling > > Hey George, > > I said random only because I do not see the method behind it, but exactly > like this when I do allreduce by MIN and return a negative number I get > either 248, 253, 11 or 6 usually. Meaning that's purely a number from MPI > side. > > The Problem with MPI_Abort is it shows the correct number in its output in > Logfile, but it does not communicate its value to other processes, or > forward its value to exit. So one also always sees these "random" values. > > When using positive numbers in range it seems to work, so my question was > on how it works, and how one can do it? Is there a way to let MPI_Abort > communicate the value as exit code? > Why do negative numbers not work, or does one simply have to always use > positive numbers? Why I would prefer Abort is because it seems safer. > > BR Alex > > > ------------------------------ > *Von:* George Bosilca <bosi...@icl.utk.edu> > *Gesendet:* Dienstag, 18. Juli 2023 18:47 > *An:* Open MPI Users <users@lists.open-mpi.org> > *Cc:* Alexander Stadik <alexander.sta...@essteyr.com> > *Betreff:* [EXT] Re: [OMPI users] Error handling > > External: Check sender address and use caution opening links or > attachments > > Alex, > > How are your values "random" if you provide correct values ? Even for > negative values you could use MIN to pick one value and return it. What is > the problem with `MPI_Abort` ? it does seem to do what you want. > > George. > > > On Tue, Jul 18, 2023 at 4:38 AM Alexander Stadik via users < > users@lists.open-mpi.org> wrote: > > Hey everyone, > > I am working for longer time now with cuda-aware OpenMPI, and developed > longer time back a small exceptions handling framework including MPI and > CUDA exceptions. > Currently I am using MPI_Abort with costum error numbers, to terminate > everything elegantly, which works well, by just reading the logfile in case > of a crash. > > Now I was wondering how one can handle return / exit codes properly > between processes, since we would like to filter non-zero exits by return > code. > > One way is a simple Allreduce (in my case) + exit instead of Abort. But > the problem seems to be the values are always "random" (since I was using > negative codes), only by using MPI error codes it seems to work correctly. > But usage of that is limited. > > Any suggestions on how to do this / how it can work properly? > > BR Alex > > > <https://www.essteyr.com/> > > <https://at.linkedin.com/company/ess-engineeringsoftwaresteyr> > <https://twitter.com/essteyr> <https://www.facebook.com/essteyr> > <https://www.instagram.com/ess_engineering_software_steyr/> > > DI Alexander Stadik > > Head of Large Scale Solutions > Research & Development | Large Scale Solutions > > Book a Meeting > <https://outlook.office365.com/owa/calendar/di%20alexandersta...@essteyr.com/bookings/> > > > Phone: +4372522044622 > Company: +43725220446 > > Mail: alexander.sta...@essteyr.com > > > Register of Firms No.: FN 427703 a > Commercial Court: District Court Steyr > UID: ATU69213102 > > ESS Engineering Software Steyr GmbH • Berggasse 35 • 4400 • Steyr • Austria > > This message is confidential. It may also be privileged or otherwise > protected by work product immunity or other legal rules. If you have > received it by mistake, please let us know by e-mail reply and delete it > from your system; you may not copy this message or disclose its contents to > anyone. Please send us by fax any message containing deadlines as incoming > e-mails are not screened for response deadlines. The integrity and security > of this message cannot be guaranteed on the Internet. > > <https://www.essteyr.com/event/1-worldwide-coatings-simulation-conference/> > > >