I think the root cause was that he expected the negative integer resulting
from the reduction to be the exit code of the application, and as I
explained in my prior email that's not how exit() works.

The exit() issue aside, MPI_Abort seems to be the right function for this
usage.

  George.


On Wed, Jul 19, 2023 at 11:08 AM Jeff Squyres (jsquyres) <jsquy...@cisco.com>
wrote:

> MPI_Allreduce should work just fine, even with negative numbers.  If you
> are seeing something different, can you provide a small reproducer program
> that shows the problem?  We can dig deeper into if if we can reproduce the
> problem.
>
> mpirun's exit status can't distinguish between MPI processes who call
> MPI_Finalize and then return a non-zero exit status and those who invoked
> MPI_Abort.  But if you have 1 process that invokes MPI_Abort with an exit
> status <255, it should be reflected in mpirun's exit status.  For example:
>
> $ cat abort.c
>
> #include <stdio.h>
>
> #include <mpi.h>
>
>
> int main(int argc, char *argv[])
>
> {
>
>     int i, rank, size;
>
>
>     MPI_Init(NULL, NULL);
>
>     MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>
>     MPI_Comm_size(MPI_COMM_WORLD, &size);
>
>
>     if (rank == size - 1) {
>
>         int err_code = 79;
>
>         fprintf(stderr, "I am rank %d and am aborting with error code
> %d\n",
>
>                 rank, err_code);
>
>         MPI_Abort(MPI_COMM_WORLD, err_code);
>
>     }
>
>
>     fprintf(stderr, "I am rank %d and am exiting with 0\n", rank);
>
>     MPI_Finalize();
>
>     return 0;
>
> }
>
>
> $ mpicc abort.c -o abort
>
>
> $ mpirun --host mpi004:2,mpi005:2 -np 4 ./abort
>
> I am rank 0 and am exiting with 0
>
> I am rank 1 and am exiting with 0
>
> I am rank 2 and am exiting with 0
>
> I am rank 3 and am aborting with error code 79
>
> --------------------------------------------------------------------------
>
> MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD
>
> with errorcode 79.
>
>
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>
> You may or may not see output from other processes, depending on
>
> exactly when Open MPI kills them.
>
> --------------------------------------------------------------------------
>
>
> $ echo $?
>
> 79
>
> ------------------------------
> *From:* users <users-boun...@lists.open-mpi.org> on behalf of Alexander
> Stadik via users <users@lists.open-mpi.org>
> *Sent:* Wednesday, July 19, 2023 12:45 AM
> *To:* George Bosilca <bosi...@icl.utk.edu>; Open MPI Users <
> users@lists.open-mpi.org>
> *Cc:* Alexander Stadik <alexander.sta...@essteyr.com>
> *Subject:* Re: [OMPI users] [EXT] Re: Error handling
>
> Hey George,
>
> I said random only because I do not see the method behind it, but exactly
> like this when I do allreduce by MIN and return a negative number I get
> either 248, 253, 11 or 6 usually. Meaning that's purely a number from MPI
> side.
>
> The Problem with MPI_Abort is it shows the correct number in its output in
> Logfile, but it does not communicate its value to other processes, or
> forward its value to exit. So one also always sees these "random" values.
>
> When using positive numbers in range it seems to work, so my question was
> on how it works, and how one can do it? Is there a way to let MPI_Abort
> communicate  the value as exit code?
> Why do negative numbers not work, or does one simply have to always use
> positive numbers? Why I would prefer Abort is because it seems safer.
>
> BR Alex
>
>
> ------------------------------
> *Von:* George Bosilca <bosi...@icl.utk.edu>
> *Gesendet:* Dienstag, 18. Juli 2023 18:47
> *An:* Open MPI Users <users@lists.open-mpi.org>
> *Cc:* Alexander Stadik <alexander.sta...@essteyr.com>
> *Betreff:* [EXT] Re: [OMPI users] Error handling
>
> External: Check sender address and use caution opening links or
> attachments
>
> Alex,
>
> How are your values "random" if you provide correct values ? Even for
> negative values you could use MIN to pick one value and return it. What is
> the problem with `MPI_Abort` ? it does seem to do what you want.
>
>   George.
>
>
> On Tue, Jul 18, 2023 at 4:38 AM Alexander Stadik via users <
> users@lists.open-mpi.org> wrote:
>
> Hey everyone,
>
> I am working for longer time now with cuda-aware OpenMPI, and developed
> longer time back a small exceptions handling framework including MPI and
> CUDA exceptions.
> Currently I am using MPI_Abort with costum error numbers, to terminate
> everything elegantly, which works well, by just reading the logfile in case
> of a crash.
>
> Now I was wondering how one can handle return / exit codes properly
> between processes, since we would like to filter non-zero exits by return
> code.
>
> One way is a simple Allreduce (in my case) + exit instead of Abort. But
> the problem seems to be the values are always "random" (since I was using
> negative codes), only by using MPI error codes it seems to work correctly.
> But usage of that is limited.
>
> Any suggestions on how to do this / how it can work properly?
>
> BR Alex
>
>
> <https://www.essteyr.com/>
>
> <https://at.linkedin.com/company/ess-engineeringsoftwaresteyr>
> <https://twitter.com/essteyr>  <https://www.facebook.com/essteyr>
> <https://www.instagram.com/ess_engineering_software_steyr/>
>
> DI Alexander Stadik
>
> Head of Large Scale Solutions
> Research & Development | Large Scale Solutions
>
> Book a Meeting
> <https://outlook.office365.com/owa/calendar/di%20alexandersta...@essteyr.com/bookings/>
>
>
> Phone:          +4372522044622
> Company:     +43725220446
>
> Mail: alexander.sta...@essteyr.com
>
>
> Register of Firms No.: FN 427703 a
> Commercial Court: District Court Steyr
> UID: ATU69213102
>
> ESS Engineering Software Steyr GmbH • Berggasse 35 • 4400 • Steyr • Austria
>
> This message is confidential. It may also be privileged or otherwise
> protected by work product immunity or other legal rules. If you have
> received it by mistake, please let us know by e-mail reply and delete it
> from your system; you may not copy this message or disclose its contents to
> anyone. Please send us by fax any message containing deadlines as incoming
> e-mails are not screened for response deadlines. The integrity and security
> of this message cannot be guaranteed on the Internet.
>
> <https://www.essteyr.com/event/1-worldwide-coatings-simulation-conference/>
>
>
>

Reply via email to