Re: [OMPI users] stdin issue with openmpi/2.0.0

r...@open-mpi.org Tue, 23 Aug 2016 17:45:35 -0700

Very strange. I cannot reproduce it as I’m able to run any number of nodes and 
procs, pushing over 100Mbytes thru without any problem.


Which leads me to suspect that the issue here is with the tty interface. Can 
you tell me what shell and OS you are running?


> On Aug 23, 2016, at 3:25 PM, Jingchao Zhang <zh...@unl.edu> wrote:
> 
> Everything stuck at MPI_Init. For a test job with 2 nodes and 10 cores each 
> node, I got the following
> 
> $ mpirun ./a.out < test.in
> Rank 2 has cleared MPI_Init
> Rank 4 has cleared MPI_Init
> Rank 7 has cleared MPI_Init
> Rank 8 has cleared MPI_Init
> Rank 0 has cleared MPI_Init
> Rank 5 has cleared MPI_Init
> Rank 6 has cleared MPI_Init
> Rank 9 has cleared MPI_Init
> Rank 1 has cleared MPI_Init
> Rank 16 has cleared MPI_Init
> Rank 19 has cleared MPI_Init
> Rank 10 has cleared MPI_Init
> Rank 11 has cleared MPI_Init
> Rank 12 has cleared MPI_Init
> Rank 13 has cleared MPI_Init
> Rank 14 has cleared MPI_Init
> Rank 15 has cleared MPI_Init
> Rank 17 has cleared MPI_Init
> Rank 18 has cleared MPI_Init
> Rank 3 has cleared MPI_Init
> 
> then it just hanged.
> 
> --Jingchao
> 
> Dr. Jingchao Zhang
> Holland Computing Center
> University of Nebraska-Lincoln
> 402-472-6400
> From: users <users-boun...@lists.open-mpi.org 
> <mailto:users-boun...@lists.open-mpi.org>> on behalf of r...@open-mpi.org 
> <mailto:r...@open-mpi.org> <r...@open-mpi.org <mailto:r...@open-mpi.org>>
> Sent: Tuesday, August 23, 2016 4:03:07 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0
>  
> The IO forwarding messages all flow over the Ethernet, so the type of fabric 
> is irrelevant. The number of procs involved would definitely have an impact, 
> but that might not be due to the IO forwarding subsystem. We know we have 
> flow control issues with collectives like Bcast that don’t have built-in 
> synchronization points. How many reads were you able to do before it hung?
> 
> I was running it on my little test setup (2 nodes, using only a few procs), 
> but I’ll try scaling up and see what happens. I’ll also try introducing some 
> forced “syncs” on the Bcast and see if that solves the issue.
> 
> Ralph
> 
>> On Aug 23, 2016, at 2:30 PM, Jingchao Zhang <zh...@unl.edu 
>> <mailto:zh...@unl.edu>> wrote:
>> 
>> Hi Ralph,
>> 
>> I tested v2.0.1rc1 with your code but has the same issue. I also installed 
>> v2.0.1rc1 on a different cluster which has Mellanox QDR Infiniband and get 
>> the same result. For the tests you have done, how many cores and nodes did 
>> you use? I can trigger the problem by using multiple nodes and each node 
>> with more than 10 cores. 
>> 
>> Thank you for looking into this.
>> 
>> Dr. Jingchao Zhang
>> Holland Computing Center
>> University of Nebraska-Lincoln
>> 402-472-6400
>> From: users <users-boun...@lists.open-mpi.org 
>> <mailto:users-boun...@lists.open-mpi.org>> on behalf of r...@open-mpi.org 
>> <mailto:r...@open-mpi.org> <r...@open-mpi.org <mailto:r...@open-mpi.org>>
>> Sent: Monday, August 22, 2016 10:23:42 PM
>> To: Open MPI Users
>> Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0
>>  
>> FWIW: I just tested forwarding up to 100MBytes via stdin using the simple 
>> test shown below with OMPI v2.0.1rc1, and it worked fine. So I’d suggest 
>> upgrading when the official release comes out, or going ahead and at least 
>> testing 2.0.1rc1 on your machine. Or you can test this program with some 
>> input file and let me know if it works for you.
>> 
>> Ralph
>> 
>> #include <stdlib.h>
>> #include <stdio.h>
>> #include <string.h>
>> #include <stdbool.h>
>> #include <unistd.h>
>> #include <mpi.h>
>> 
>> #define ORTE_IOF_BASE_MSG_MAX   2048
>> 
>> int main(int argc, char *argv[])
>> {
>>     int i, rank, size, next, prev, tag = 201;
>>     int pos, msgsize, nbytes;
>>     bool done;
>>     char *msg;
>> 
>>     MPI_Init(&argc, &argv);
>>     MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>>     MPI_Comm_size(MPI_COMM_WORLD, &size);
>> 
>>     fprintf(stderr, "Rank %d has cleared MPI_Init\n", rank);
>> 
>>     next = (rank + 1) % size;
>>     prev = (rank + size - 1) % size;
>>     msg = malloc(ORTE_IOF_BASE_MSG_MAX);
>>     pos = 0;
>>     nbytes = 0;
>> 
>>     if (0 == rank) {
>>         while (0 != (msgsize = read(0, msg, ORTE_IOF_BASE_MSG_MAX))) {
>>             fprintf(stderr, "Rank %d: sending blob %d\n", rank, pos);
>>             if (msgsize > 0) {
>>                 MPI_Bcast(msg, ORTE_IOF_BASE_MSG_MAX, MPI_BYTE, 0, 
>> MPI_COMM_WORLD);
>>             }
>>             ++pos;
>>             nbytes += msgsize;
>>         }
>>         fprintf(stderr, "Rank %d: sending termination blob %d\n", rank, pos);
>>         memset(msg, 0, ORTE_IOF_BASE_MSG_MAX);
>>         MPI_Bcast(msg, ORTE_IOF_BASE_MSG_MAX, MPI_BYTE, 0, MPI_COMM_WORLD);
>>         MPI_Barrier(MPI_COMM_WORLD);
>>     } else {
>>         while (1) {
>>             MPI_Bcast(msg, ORTE_IOF_BASE_MSG_MAX, MPI_BYTE, 0, 
>> MPI_COMM_WORLD);
>>             fprintf(stderr, "Rank %d: recvd blob %d\n", rank, pos);
>>             ++pos;
>>             done = true;
>>             for (i=0; i < ORTE_IOF_BASE_MSG_MAX; i++) {
>>                 if (0 != msg[i]) {
>>                     done = false;
>>                     break;
>>                 }
>>             }
>>             if (done) {
>>                 break;
>>             }
>>         }
>>         fprintf(stderr, "Rank %d: recv done\n", rank);
>>         MPI_Barrier(MPI_COMM_WORLD);
>>     }
>> 
>>     fprintf(stderr, "Rank %d has completed bcast\n", rank);
>>     MPI_Finalize();
>>     return 0;
>> }
>> 
>> 
>> 
>>> On Aug 22, 2016, at 3:40 PM, Jingchao Zhang <zh...@unl.edu 
>>> <mailto:zh...@unl.edu>> wrote:
>>> 
>>> This might be a thin argument but we have many users running mpirun in this 
>>> way for years with no problem until this recent upgrade. And some 
>>> home-brewed mpi codes do not even have a standard way to read the input 
>>> files. Last time I checked, the openmpi manual still claims it supports 
>>> stdin (https://www.open-mpi.org/doc/v2.0/man1/mpirun.1.php#sect14 
>>> <https://www.open-mpi.org/doc/v2.0/man1/mpirun.1.php#sect14>). Maybe I 
>>> missed it but the v2.0 release notes did not mention any changes to the 
>>> behaviors of stdin as well.
>>> 
>>> We can tell our users to run mpirun in the suggested way, but I do hope 
>>> someone can look into the issue and fix it.
>>> 
>>> Dr. Jingchao Zhang
>>> Holland Computing Center
>>> University of Nebraska-Lincoln
>>> 402-472-6400
>>> From: users <users-boun...@lists.open-mpi.org 
>>> <mailto:users-boun...@lists.open-mpi.org>> on behalf of r...@open-mpi.org 
>>> <mailto:r...@open-mpi.org> <r...@open-mpi.org <mailto:r...@open-mpi.org>>
>>> Sent: Monday, August 22, 2016 3:04:50 PM
>>> To: Open MPI Users
>>> Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0
>>>  
>>> Well, I can try to find time to take a look. However, I will reiterate what 
>>> Jeff H said - it is very unwise to rely on IO forwarding. Much better to 
>>> just directly read the file unless that file is simply unavailable on the 
>>> node where rank=0 is running.
>>> 
>>>> On Aug 22, 2016, at 1:55 PM, Jingchao Zhang <zh...@unl.edu 
>>>> <mailto:zh...@unl.edu>> wrote:
>>>> 
>>>> Here you can find the source code for lammps input 
>>>> https://github.com/lammps/lammps/blob/r13864/src/input.cpp 
>>>> <https://github.com/lammps/lammps/blob/r13864/src/input.cpp>
>>>> Based on the gdb output, rank 0 stuck at line 167
>>>> if
>>>>  (fgets(&line[m],maxline-m,infile)
>>>>  == NULL)
>>>> and the rest threads stuck at line 203
>>>> MPI_Bcast(&n,1,MPI_INT,0,world);
>>>> 
>>>> So rank 0 possibly hangs on the fgets() function.
>>>> 
>>>> Here are the whole backtrace information:
>>>> $ cat master.backtrace worker.backtrace
>>>> #0  0x0000003c37cdb68d in read () from /lib64/libc.so.6
>>>> #1  0x0000003c37c71ca8 in _IO_new_file_underflow () from /lib64/libc.so.6
>>>> #2  0x0000003c37c737ae in _IO_default_uflow_internal () from 
>>>> /lib64/libc.so.6
>>>> #3  0x0000003c37c67e8a in _IO_getline_info_internal () from 
>>>> /lib64/libc.so.6
>>>> #4  0x0000003c37c66ce9 in fgets () from /lib64/libc.so.6
>>>> #5  0x00000000005c5a43 in LAMMPS_NS::Input::file() () at ../input.cpp:167
>>>> #6  0x00000000005d4236 in main () at ../main.cpp:31
>>>> #0  0x00002b1635d2ace2 in poll_dispatch () from 
>>>> /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libopen-pal.so.20
>>>> #1  0x00002b1635d1fa71 in opal_libevent2022_event_base_loop ()
>>>>    from /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libopen-pal.so.20
>>>> #2  0x00002b1635ce4634 in opal_progress () from 
>>>> /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libopen-pal.so.20
>>>> #3  0x00002b16351b8fad in ompi_request_default_wait () from 
>>>> /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libmpi.so.20
>>>> #4  0x00002b16351fcb40 in ompi_coll_base_bcast_intra_generic ()
>>>>    from /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libmpi.so.20
>>>> #5  0x00002b16351fd0c2 in ompi_coll_base_bcast_intra_binomial ()
>>>>    from /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libmpi.so.20
>>>> #6  0x00002b1644fa6d9b in ompi_coll_tuned_bcast_intra_dec_fixed ()
>>>>    from /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/openmpi/mca_coll_tuned.so
>>>> #7  0x00002b16351cb4fb in PMPI_Bcast () from 
>>>> /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libmpi.so.20
>>>> #8  0x00000000005c5b5d in LAMMPS_NS::Input::file() () at ../input.cpp:203
>>>> #9  0x00000000005d4236 in main () at ../main.cpp:31
>>>> 
>>>> Thanks,
>>>> 
>>>> Dr. Jingchao Zhang
>>>> Holland Computing Center
>>>> University of Nebraska-Lincoln
>>>> 402-472-6400
>>>> From: users <users-boun...@lists.open-mpi.org 
>>>> <mailto:users-boun...@lists.open-mpi.org>> on behalf of r...@open-mpi.org 
>>>> <mailto:r...@open-mpi.org> <r...@open-mpi.org <mailto:r...@open-mpi.org>>
>>>> Sent: Monday, August 22, 2016 2:17:10 PM
>>>> To: Open MPI Users
>>>> Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0
>>>>  
>>>> Hmmm...perhaps we can break this out a bit? The stdin will be going to 
>>>> your rank=0 proc. It sounds like you have some subsequent step that calls 
>>>> MPI_Bcast?
>>>> 
>>>> Can you first verify that the input is being correctly delivered to 
>>>> rank=0? This will help us isolate if the problem is in the IO forwarding, 
>>>> or in the subsequent Bcast.
>>>> 
>>>>> On Aug 22, 2016, at 1:11 PM, Jingchao Zhang <zh...@unl.edu 
>>>>> <mailto:zh...@unl.edu>> wrote:
>>>>> 
>>>>> Hi all,
>>>>> 
>>>>> We compiled openmpi/2.0.0 with gcc/6.1.0 and intel/13.1.3. Both of them 
>>>>> have odd behaviors when trying to read from standard input.
>>>>> 
>>>>> For example, if we start the application lammps across 4 nodes, each node 
>>>>> 16 cores, connected by Intel QDR Infiniband, mpirun works fine for the 
>>>>> 1st time, but always stuck in a few seconds thereafter.
>>>>> Command:
>>>>> mpirun ./lmp_ompi_g++ < in.snr
>>>>> in.snr is the Lammps input file. compiler is gcc/6.1.
>>>>> 
>>>>> Instead, if we use
>>>>> mpirun ./lmp_ompi_g++ -in in.snr
>>>>> it works 100%.
>>>>> 
>>>>> Some odd behaviors we gathered so far. 
>>>>> 1. For 1 node job, stdin always works.
>>>>> 2. For multiple nodes, stdin works unstably when the number of cores per 
>>>>> node are relatively small. For example, for 2/3/4 nodes, each node 8 
>>>>> cores, mpirun works most of the time. But for each node with >8 cores, 
>>>>> mpirun works the 1st time, then always stuck. There seems to be a magic 
>>>>> number when it stops working.
>>>>> 3. We tested Quantum Expresso with compiler intel/13 and had the same 
>>>>> issue. 
>>>>> 
>>>>> We used gdb to debug and found when mpirun was stuck, the rest of the 
>>>>> processes were all waiting on mpi broadcast from the master thread. The 
>>>>> lammps binary, input file and gdb core files (example.tar.bz2) can be 
>>>>> downloaded from this link 
>>>>> https://drive.google.com/open?id=0B3Yj4QkZpI-dVWZtWmJ3ZXNVRGc 
>>>>> <https://drive.google.com/open?id=0B3Yj4QkZpI-dVWZtWmJ3ZXNVRGc>
>>>>> 
>>>>> Extra information:
>>>>> 1. Job scheduler is slurm.
>>>>> 2. configure setup:
>>>>> ./configure     --prefix=$PREFIX \
>>>>>                 --with-hwloc=internal \
>>>>>                 --enable-mpirun-prefix-by-default \
>>>>>                 --with-slurm \
>>>>>                 --with-verbs \
>>>>>                 --with-psm \
>>>>>                 --disable-openib-connectx-xrc \
>>>>>                 --with-knem=/opt/knem-1.1.2.90mlnx1 \
>>>>>                 --with-cma
>>>>> 3. openmpi-mca-params.conf file 
>>>>> orte_hetero_nodes=1
>>>>> hwloc_base_binding_policy=core
>>>>> rmaps_base_mapping_policy=core
>>>>> opal_cuda_support=0
>>>>> btl_openib_use_eager_rdma=0
>>>>> btl_openib_max_eager_rdma=0
>>>>> btl_openib_flags=1
>>>>> 
>>>>> Thanks,
>>>>> Jingchao 
>>>>> 
>>>>> Dr. Jingchao Zhang
>>>>> Holland Computing Center
>>>>> University of Nebraska-Lincoln
>>>>> 402-472-6400
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>>>> _______________________________________________
>>>> users mailing list
>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] stdin issue with openmpi/2.0.0

Reply via email to