Re: [OMPI users] stdin issue with openmpi/2.0.0

Jingchao Zhang Tue, 23 Aug 2016 15:29:50 -0700

Everything stuck at MPI_Init. For a test job with 2 nodes and 10 cores each 
node, I got the following



$ mpirun ./a.out < test.in
Rank 2 has cleared MPI_Init
Rank 4 has cleared MPI_Init
Rank 7 has cleared MPI_Init
Rank 8 has cleared MPI_Init
Rank 0 has cleared MPI_Init
Rank 5 has cleared MPI_Init
Rank 6 has cleared MPI_Init
Rank 9 has cleared MPI_Init
Rank 1 has cleared MPI_Init
Rank 16 has cleared MPI_Init
Rank 19 has cleared MPI_Init
Rank 10 has cleared MPI_Init
Rank 11 has cleared MPI_Init
Rank 12 has cleared MPI_Init
Rank 13 has cleared MPI_Init
Rank 14 has cleared MPI_Init
Rank 15 has cleared MPI_Init
Rank 17 has cleared MPI_Init
Rank 18 has cleared MPI_Init
Rank 3 has cleared MPI_Init

then it just hanged.

--Jingchao


Dr. Jingchao Zhang
Holland Computing Center
University of Nebraska-Lincoln
402-472-6400
________________________________
From: users <users-boun...@lists.open-mpi.org> on behalf of r...@open-mpi.org 
<r...@open-mpi.org>
Sent: Tuesday, August 23, 2016 4:03:07 PM
To: Open MPI Users
Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0

The IO forwarding messages all flow over the Ethernet, so the type of fabric is 
irrelevant. The number of procs involved would definitely have an impact, but 
that might not be due to the IO forwarding subsystem. We know we have flow 
control issues with collectives like Bcast that don’t have built-in 
synchronization points. How many reads were you able to do before it hung?

I was running it on my little test setup (2 nodes, using only a few procs), but 
I’ll try scaling up and see what happens. I’ll also try introducing some forced 
“syncs” on the Bcast and see if that solves the issue.

Ralph

On Aug 23, 2016, at 2:30 PM, Jingchao Zhang 
<zh...@unl.edu<mailto:zh...@unl.edu>> wrote:

Hi Ralph,

I tested v2.0.1rc1 with your code but has the same issue. I also installed 
v2.0.1rc1 on a different cluster which has Mellanox QDR Infiniband and get the 
same result. For the tests you have done, how many cores and nodes did you use? 
I can trigger the problem by using multiple nodes and each node with more than 
10 cores.

Thank you for looking into this.

Dr. Jingchao Zhang
Holland Computing Center
University of Nebraska-Lincoln
402-472-6400
________________________________
From: users 
<users-boun...@lists.open-mpi.org<mailto:users-boun...@lists.open-mpi.org>> on 
behalf of r...@open-mpi.org<mailto:r...@open-mpi.org> 
<r...@open-mpi.org<mailto:r...@open-mpi.org>>
Sent: Monday, August 22, 2016 10:23:42 PM
To: Open MPI Users
Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0

FWIW: I just tested forwarding up to 100MBytes via stdin using the simple test 
shown below with OMPI v2.0.1rc1, and it worked fine. So I’d suggest upgrading 
when the official release comes out, or going ahead and at least testing 
2.0.1rc1 on your machine. Or you can test this program with some input file and 
let me know if it works for you.

Ralph

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <stdbool.h>
#include <unistd.h>
#include <mpi.h>

#define ORTE_IOF_BASE_MSG_MAX   2048

int main(int argc, char *argv[])
{
    int i, rank, size, next, prev, tag = 201;
    int pos, msgsize, nbytes;
    bool done;
    char *msg;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    fprintf(stderr, "Rank %d has cleared MPI_Init\n", rank);

    next = (rank + 1) % size;
    prev = (rank + size - 1) % size;
    msg = malloc(ORTE_IOF_BASE_MSG_MAX);
    pos = 0;
    nbytes = 0;

    if (0 == rank) {
        while (0 != (msgsize = read(0, msg, ORTE_IOF_BASE_MSG_MAX))) {
            fprintf(stderr, "Rank %d: sending blob %d\n", rank, pos);
            if (msgsize > 0) {
                MPI_Bcast(msg, ORTE_IOF_BASE_MSG_MAX, MPI_BYTE, 0, 
MPI_COMM_WORLD);
            }
            ++pos;
            nbytes += msgsize;
        }
        fprintf(stderr, "Rank %d: sending termination blob %d\n", rank, pos);
        memset(msg, 0, ORTE_IOF_BASE_MSG_MAX);
        MPI_Bcast(msg, ORTE_IOF_BASE_MSG_MAX, MPI_BYTE, 0, MPI_COMM_WORLD);
        MPI_Barrier(MPI_COMM_WORLD);
    } else {
        while (1) {
            MPI_Bcast(msg, ORTE_IOF_BASE_MSG_MAX, MPI_BYTE, 0, MPI_COMM_WORLD);
            fprintf(stderr, "Rank %d: recvd blob %d\n", rank, pos);
            ++pos;
            done = true;
            for (i=0; i < ORTE_IOF_BASE_MSG_MAX; i++) {
                if (0 != msg[i]) {
                    done = false;
                    break;
                }
            }
            if (done) {
                break;
            }
        }
        fprintf(stderr, "Rank %d: recv done\n", rank);
        MPI_Barrier(MPI_COMM_WORLD);
    }

    fprintf(stderr, "Rank %d has completed bcast\n", rank);
    MPI_Finalize();
    return 0;
}



On Aug 22, 2016, at 3:40 PM, Jingchao Zhang 
<zh...@unl.edu<mailto:zh...@unl.edu>> wrote:

This might be a thin argument but we have many users running mpirun in this way 
for years with no problem until this recent upgrade. And some home-brewed mpi 
codes do not even have a standard way to read the input files. Last time I 
checked, the openmpi manual still claims it supports stdin 
(https://www.open-mpi.org/doc/v2.0/man1/mpirun.1.php#sect14). Maybe I missed it 
but the v2.0 release notes did not mention any changes to the behaviors of 
stdin as well.

We can tell our users to run mpirun in the suggested way, but I do hope someone 
can look into the issue and fix it.

Dr. Jingchao Zhang
Holland Computing Center
University of Nebraska-Lincoln
402-472-6400
________________________________
From: users 
<users-boun...@lists.open-mpi.org<mailto:users-boun...@lists.open-mpi.org>> on 
behalf of r...@open-mpi.org<mailto:r...@open-mpi.org> 
<r...@open-mpi.org<mailto:r...@open-mpi.org>>
Sent: Monday, August 22, 2016 3:04:50 PM
To: Open MPI Users
Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0

Well, I can try to find time to take a look. However, I will reiterate what 
Jeff H said - it is very unwise to rely on IO forwarding. Much better to just 
directly read the file unless that file is simply unavailable on the node where 
rank=0 is running.

On Aug 22, 2016, at 1:55 PM, Jingchao Zhang 
<zh...@unl.edu<mailto:zh...@unl.edu>> wrote:

Here you can find the source code for lammps input 
https://github.com/lammps/lammps/blob/r13864/src/input.cpp

Based on the gdb output, rank 0 stuck at line 167
if (fgets(&line[m],maxline-m,infile) == NULL)
and the rest threads stuck at line 203
MPI_Bcast(&n,1,MPI_INT,0,world);

So rank 0 possibly hangs on the fgets() function.

Here are the whole backtrace information:

$ cat master.backtrace worker.backtrace
#0  0x0000003c37cdb68d in read () from /lib64/libc.so.6
#1  0x0000003c37c71ca8 in _IO_new_file_underflow () from /lib64/libc.so.6
#2  0x0000003c37c737ae in _IO_default_uflow_internal () from /lib64/libc.so.6
#3  0x0000003c37c67e8a in _IO_getline_info_internal () from /lib64/libc.so.6
#4  0x0000003c37c66ce9 in fgets () from /lib64/libc.so.6
#5  0x00000000005c5a43 in LAMMPS_NS::Input::file() () at ../input.cpp:167
#6  0x00000000005d4236 in main () at ../main.cpp:31
#0  0x00002b1635d2ace2 in poll_dispatch () from 
/util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libopen-pal.so.20
#1  0x00002b1635d1fa71 in opal_libevent2022_event_base_loop ()
   from /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libopen-pal.so.20
#2  0x00002b1635ce4634 in opal_progress () from 
/util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libopen-pal.so.20
#3  0x00002b16351b8fad in ompi_request_default_wait () from 
/util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libmpi.so.20
#4  0x00002b16351fcb40 in ompi_coll_base_bcast_intra_generic ()
   from /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libmpi.so.20
#5  0x00002b16351fd0c2 in ompi_coll_base_bcast_intra_binomial ()
   from /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libmpi.so.20
#6  0x00002b1644fa6d9b in ompi_coll_tuned_bcast_intra_dec_fixed ()
   from /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/openmpi/mca_coll_tuned.so
#7  0x00002b16351cb4fb in PMPI_Bcast () from 
/util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libmpi.so.20
#8  0x00000000005c5b5d in LAMMPS_NS::Input::file() () at ../input.cpp:203
#9  0x00000000005d4236 in main () at ../main.cpp:31

Thanks,

Dr. Jingchao Zhang
Holland Computing Center
University of Nebraska-Lincoln
402-472-6400
________________________________
From: users 
<users-boun...@lists.open-mpi.org<mailto:users-boun...@lists.open-mpi.org>> on 
behalf of r...@open-mpi.org<mailto:r...@open-mpi.org> 
<r...@open-mpi.org<mailto:r...@open-mpi.org>>
Sent: Monday, August 22, 2016 2:17:10 PM
To: Open MPI Users
Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0

Hmmm...perhaps we can break this out a bit? The stdin will be going to your 
rank=0 proc. It sounds like you have some subsequent step that calls MPI_Bcast?

Can you first verify that the input is being correctly delivered to rank=0? 
This will help us isolate if the problem is in the IO forwarding, or in the 
subsequent Bcast.

On Aug 22, 2016, at 1:11 PM, Jingchao Zhang 
<zh...@unl.edu<mailto:zh...@unl.edu>> wrote:

Hi all,

We compiled openmpi/2.0.0 with gcc/6.1.0 and intel/13.1.3. Both of them have 
odd behaviors when trying to read from standard input.

For example, if we start the application lammps across 4 nodes, each node 16 
cores, connected by Intel QDR Infiniband, mpirun works fine for the 1st time, 
but always stuck in a few seconds thereafter.
Command:
mpirun ./lmp_ompi_g++ < in.snr
in.snr is the Lammps input file. compiler is gcc/6.1.

Instead, if we use
mpirun ./lmp_ompi_g++ -in in.snr
it works 100%.

Some odd behaviors we gathered so far.
1. For 1 node job, stdin always works.
2. For multiple nodes, stdin works unstably when the number of cores per node 
are relatively small. For example, for 2/3/4 nodes, each node 8 cores, mpirun 
works most of the time. But for each node with >8 cores, mpirun works the 1st 
time, then always stuck. There seems to be a magic number when it stops working.
3. We tested Quantum Expresso with compiler intel/13 and had the same issue.

We used gdb to debug and found when mpirun was stuck, the rest of the processes 
were all waiting on mpi broadcast from the master thread. The lammps binary, 
input file and gdb core files (example.tar.bz2) can be downloaded from this 
link https://drive.google.com/open?id=0B3Yj4QkZpI-dVWZtWmJ3ZXNVRGc

Extra information:
1. Job scheduler is slurm.
2. configure setup:

./configure     --prefix=$PREFIX \
                --with-hwloc=internal \
                --enable-mpirun-prefix-by-default \
                --with-slurm \
                --with-verbs \
                --with-psm \
                --disable-openib-connectx-xrc \
                --with-knem=/opt/knem-1.1.2.90mlnx1 \
                --with-cma
3. openmpi-mca-params.conf file
orte_hetero_nodes=1
hwloc_base_binding_policy=core
rmaps_base_mapping_policy=core
opal_cuda_support=0
btl_openib_use_eager_rdma=0
btl_openib_max_eager_rdma=0
btl_openib_flags=1

Thanks,
Jingchao

Dr. Jingchao Zhang
Holland Computing Center
University of Nebraska-Lincoln
402-472-6400
_______________________________________________
users mailing list
users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] stdin issue with openmpi/2.0.0

Reply via email to