I’m sorry, but something is simply very wrong here. Are you sure you are pointed at the correct LD_LIBRARY_PATH? Perhaps add a “BOO” or something at the front of the output message to ensure we are using the correct plugin?
This looks to me like you must be picking up a stale library somewhere. > On Aug 29, 2016, at 10:29 AM, Jingchao Zhang <zh...@unl.edu> wrote: > > Hi Ralph, > > I used the tarball from Aug 26 and added the patch. Tested with 2 nodes, 10 > cores/node. Please see the results below: > > $ mpirun ./a.out < test.in > [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] > [[43954,0],0] iof:hnp pushing fd 35 for process [[43954,1],0] > [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] > [[43954,0],0] iof:hnp pushing fd 41 for process [[43954,1],0] > [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] > [[43954,0],0] iof:hnp pushing fd 43 for process [[43954,1],0] > [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] > [[43954,0],0] iof:hnp pushing fd 37 for process [[43954,1],1] > [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] > [[43954,0],0] iof:hnp pushing fd 46 for process [[43954,1],1] > [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] > [[43954,0],0] iof:hnp pushing fd 49 for process [[43954,1],1] > [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] > [[43954,0],0] iof:hnp pushing fd 38 for process [[43954,1],2] > [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] > [[43954,0],0] iof:hnp pushing fd 50 for process [[43954,1],2] > [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] > [[43954,0],0] iof:hnp pushing fd 52 for process [[43954,1],2] > [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] > [[43954,0],0] iof:hnp pushing fd 42 for process [[43954,1],3] > [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] > [[43954,0],0] iof:hnp pushing fd 53 for process [[43954,1],3] > [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] > [[43954,0],0] iof:hnp pushing fd 55 for process [[43954,1],3] > [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] > [[43954,0],0] iof:hnp pushing fd 45 for process [[43954,1],4] > [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] > [[43954,0],0] iof:hnp pushing fd 56 for process [[43954,1],4] > [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] > [[43954,0],0] iof:hnp pushing fd 58 for process [[43954,1],4] > [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] > [[43954,0],0] iof:hnp pushing fd 47 for process [[43954,1],5] > [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] > [[43954,0],0] iof:hnp pushing fd 59 for process [[43954,1],5] > [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] > [[43954,0],0] iof:hnp pushing fd 61 for process [[43954,1],5] > [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] > [[43954,0],0] iof:hnp pushing fd 57 for process [[43954,1],6] > [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] > [[43954,0],0] iof:hnp pushing fd 64 for process [[43954,1],6] > [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] > [[43954,0],0] iof:hnp pushing fd 66 for process [[43954,1],6] > [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] > [[43954,0],0] iof:hnp pushing fd 62 for process [[43954,1],7] > [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] > [[43954,0],0] iof:hnp pushing fd 68 for process [[43954,1],7] > [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] > [[43954,0],0] iof:hnp pushing fd 70 for process [[43954,1],7] > [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] > [[43954,0],0] iof:hnp pushing fd 65 for process [[43954,1],8] > [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] > [[43954,0],0] iof:hnp pushing fd 72 for process [[43954,1],8] > [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] > [[43954,0],0] iof:hnp pushing fd 74 for process [[43954,1],8] > [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] > [[43954,0],0] iof:hnp pushing fd 75 for process [[43954,1],9] > [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] > [[43954,0],0] iof:hnp pushing fd 79 for process [[43954,1],9] > [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] > [[43954,0],0] iof:hnp pushing fd 81 for process [[43954,1],9] > Rank 5 has cleared MPI_Init > Rank 9 has cleared MPI_Init > Rank 1 has cleared MPI_Init > Rank 2 has cleared MPI_Init > Rank 3 has cleared MPI_Init > Rank 4 has cleared MPI_Init > Rank 8 has cleared MPI_Init > Rank 0 has cleared MPI_Init > Rank 6 has cleared MPI_Init > Rank 7 has cleared MPI_Init > Rank 14 has cleared MPI_Init > Rank 15 has cleared MPI_Init > Rank 16 has cleared MPI_Init > Rank 18 has cleared MPI_Init > Rank 10 has cleared MPI_Init > Rank 11 has cleared MPI_Init > Rank 12 has cleared MPI_Init > Rank 13 has cleared MPI_Init > Rank 17 has cleared MPI_Init > Rank 19 has cleared MPI_Init > > Thanks, > > Dr. Jingchao Zhang > Holland Computing Center > University of Nebraska-Lincoln > 402-472-6400 > From: users <users-boun...@lists.open-mpi.org > <mailto:users-boun...@lists.open-mpi.org>> on behalf of r...@open-mpi.org > <mailto:r...@open-mpi.org> <r...@open-mpi.org <mailto:r...@open-mpi.org>> > Sent: Saturday, August 27, 2016 12:31:53 PM > To: Open MPI Users > Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0 > > I am finding this impossible to replicate, so something odd must be going on. > Can you please (a) pull down the latest v2.0.1 nightly tarball, and (b) add > this patch to it? > > diff --git a/orte/mca/iof/hnp/iof_hnp.c b/orte/mca/iof/hnp/iof_hnp.c > old mode 100644 > new mode 100755 > index 512fcdb..362ff46 > --- a/orte/mca/iof/hnp/iof_hnp.c > +++ b/orte/mca/iof/hnp/iof_hnp.c > @@ -143,16 +143,17 @@ static int hnp_push(const orte_process_name_t* > dst_name, orte_iof_tag_t src_tag, > int np, numdigs; > orte_ns_cmp_bitmask_t mask; > > + opal_output(0, > + "%s iof:hnp pushing fd %d for process %s", > + ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), > + fd, ORTE_NAME_PRINT(dst_name)); > + > /* don't do this if the dst vpid is invalid or the fd is negative! */ > if (ORTE_VPID_INVALID == dst_name->vpid || fd < 0) { > return ORTE_SUCCESS; > } > > - OPAL_OUTPUT_VERBOSE((1, orte_iof_base_framework.framework_output, > - "%s iof:hnp pushing fd %d for process %s", > - ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), > - fd, ORTE_NAME_PRINT(dst_name))); > - > if (!(src_tag & ORTE_IOF_STDIN)) { > /* set the file descriptor to non-blocking - do this before we setup > * and activate the read event in case it fires right away > > > You can then run the test again without the "--mca iof_base_verbose 100” flag > to reduce the chatter - this print statement will tell me what I need to know. > > Thanks! > Ralph > > >> On Aug 25, 2016, at 8:19 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com >> <mailto:jsquy...@cisco.com>> wrote: >> >> The IOF fix PR for v2.0.1 was literally just merged a few minutes ago; it >> wasn't in last night's tarball. >> >> >> >>> On Aug 25, 2016, at 10:59 AM, r...@open-mpi.org <mailto:r...@open-mpi.org> >>> wrote: >>> >>> ??? Weird - can you send me an updated output of that last test we ran? >>> >>>> On Aug 25, 2016, at 7:51 AM, Jingchao Zhang <zh...@unl.edu >>>> <mailto:zh...@unl.edu>> wrote: >>>> >>>> Hi Ralph, >>>> >>>> I saw the pull request and did a test with v2.0.1rc1, but the problem >>>> persists. Any ideas? >>>> >>>> Thanks, >>>> >>>> Dr. Jingchao Zhang >>>> Holland Computing Center >>>> University of Nebraska-Lincoln >>>> 402-472-6400 >>>> From: users <users-boun...@lists.open-mpi.org >>>> <mailto:users-boun...@lists.open-mpi.org>> on behalf of r...@open-mpi.org >>>> <mailto:r...@open-mpi.org> <r...@open-mpi.org <mailto:r...@open-mpi.org>> >>>> Sent: Wednesday, August 24, 2016 1:27:28 PM >>>> To: Open MPI Users >>>> Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0 >>>> >>>> Bingo - found it, fix submitted and hope to get it into 2.0.1 >>>> >>>> Thanks for the assist! >>>> Ralph >>>> >>>> >>>>> On Aug 24, 2016, at 12:15 PM, Jingchao Zhang <zh...@unl.edu >>>>> <mailto:zh...@unl.edu>> wrote: >>>>> >>>>> I configured v2.0.1rc1 with --enable-debug and ran the test with --mca >>>>> iof_base_verbose 100. I also added -display-devel-map in case it provides >>>>> some useful information. >>>>> >>>>> Test job has 2 nodes, each node 10 cores. Rank 0 and mpirun command on >>>>> the same node. >>>>> $ mpirun -display-devel-map --mca iof_base_verbose 100 ./a.out < test.in >>>>> &> debug_info.txt >>>>> >>>>> The debug_info.txt is attached. >>>>> >>>>> Dr. Jingchao Zhang >>>>> Holland Computing Center >>>>> University of Nebraska-Lincoln >>>>> 402-472-6400 >>>>> From: users <users-boun...@lists.open-mpi.org >>>>> <mailto:users-boun...@lists.open-mpi.org>> on behalf of r...@open-mpi.org >>>>> <mailto:r...@open-mpi.org> <r...@open-mpi.org <mailto:r...@open-mpi.org>> >>>>> Sent: Wednesday, August 24, 2016 12:14:26 PM >>>>> To: Open MPI Users >>>>> Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0 >>>>> >>>>> Afraid I can’t replicate a problem at all, whether rank=0 is local or >>>>> not. I’m also using bash, but on CentOS-7, so I suspect the OS is the >>>>> difference. >>>>> >>>>> Can you configure OMPI with --enable-debug, and then run the test again >>>>> with --mca iof_base_verbose 100? It will hopefully tell us something >>>>> about why the IO subsystem is stuck. >>>>> >>>>> >>>>>> On Aug 24, 2016, at 8:46 AM, Jingchao Zhang <zh...@unl.edu >>>>>> <mailto:zh...@unl.edu>> wrote: >>>>>> >>>>>> Hi Ralph, >>>>>> >>>>>> For our tests, rank 0 is always on the same node with mpirun. I just >>>>>> tested mpirun with -nolocal and it still hangs. >>>>>> >>>>>> Information on shell and OS >>>>>> $ echo $0 >>>>>> -bash >>>>>> >>>>>> $ lsb_release -a >>>>>> LSB Version: >>>>>> :base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch >>>>>> Distributor ID: Scientific >>>>>> Description: Scientific Linux release 6.8 (Carbon) >>>>>> Release: 6.8 >>>>>> Codename: Carbon >>>>>> >>>>>> $ uname -a >>>>>> Linux login.crane.hcc.unl.edu <http://login.crane.hcc.unl.edu/> >>>>>> 2.6.32-642.3.1.el6.x86_64 #1 SMP Tue Jul 12 11:25:51 CDT 2016 x86_64 >>>>>> x86_64 x86_64 GNU/Linux >>>>>> >>>>>> >>>>>> Dr. Jingchao Zhang >>>>>> Holland Computing Center >>>>>> University of Nebraska-Lincoln >>>>>> 402-472-6400 >>>>>> From: users <users-boun...@lists.open-mpi.org >>>>>> <mailto:users-boun...@lists.open-mpi.org>> on behalf of >>>>>> r...@open-mpi.org <mailto:r...@open-mpi.org> <r...@open-mpi.org >>>>>> <mailto:r...@open-mpi.org>> >>>>>> Sent: Tuesday, August 23, 2016 8:14:48 PM >>>>>> To: Open MPI Users >>>>>> Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0 >>>>>> >>>>>> Hmmm...that’s a good point. Rank 0 and mpirun are always on the same >>>>>> node on my cluster. I’ll give it a try. >>>>>> >>>>>> Jingchao: is rank 0 on the node with mpirun, or on a remote node? >>>>>> >>>>>> >>>>>>> On Aug 23, 2016, at 5:58 PM, Gilles Gouaillardet <gil...@rist.or.jp >>>>>>> <mailto:gil...@rist.or.jp>> wrote: >>>>>>> >>>>>>> Ralph, >>>>>>> >>>>>>> did you run task 0 and mpirun on different nodes ? >>>>>>> >>>>>>> i observed some random hangs, though i cannot blame openmpi 100% yet >>>>>>> >>>>>>> Cheers, >>>>>>> >>>>>>> Gilles >>>>>>> >>>>>>> On 8/24/2016 9:41 AM, r...@open-mpi.org <mailto:r...@open-mpi.org> >>>>>>> wrote: >>>>>>>> Very strange. I cannot reproduce it as I’m able to run any number of >>>>>>>> nodes and procs, pushing over 100Mbytes thru without any problem. >>>>>>>> >>>>>>>> Which leads me to suspect that the issue here is with the tty >>>>>>>> interface. Can you tell me what shell and OS you are running? >>>>>>>> >>>>>>>> >>>>>>>>> On Aug 23, 2016, at 3:25 PM, Jingchao Zhang <zh...@unl.edu >>>>>>>>> <mailto:zh...@unl.edu>> wrote: >>>>>>>>> >>>>>>>>> Everything stuck at MPI_Init. For a test job with 2 nodes and 10 >>>>>>>>> cores each node, I got the following >>>>>>>>> >>>>>>>>> $ mpirun ./a.out < test.in >>>>>>>>> Rank 2 has cleared MPI_Init >>>>>>>>> Rank 4 has cleared MPI_Init >>>>>>>>> Rank 7 has cleared MPI_Init >>>>>>>>> Rank 8 has cleared MPI_Init >>>>>>>>> Rank 0 has cleared MPI_Init >>>>>>>>> Rank 5 has cleared MPI_Init >>>>>>>>> Rank 6 has cleared MPI_Init >>>>>>>>> Rank 9 has cleared MPI_Init >>>>>>>>> Rank 1 has cleared MPI_Init >>>>>>>>> Rank 16 has cleared MPI_Init >>>>>>>>> Rank 19 has cleared MPI_Init >>>>>>>>> Rank 10 has cleared MPI_Init >>>>>>>>> Rank 11 has cleared MPI_Init >>>>>>>>> Rank 12 has cleared MPI_Init >>>>>>>>> Rank 13 has cleared MPI_Init >>>>>>>>> Rank 14 has cleared MPI_Init >>>>>>>>> Rank 15 has cleared MPI_Init >>>>>>>>> Rank 17 has cleared MPI_Init >>>>>>>>> Rank 18 has cleared MPI_Init >>>>>>>>> Rank 3 has cleared MPI_Init >>>>>>>>> >>>>>>>>> then it just hanged. >>>>>>>>> >>>>>>>>> --Jingchao >>>>>>>>> >>>>>>>>> Dr. Jingchao Zhang >>>>>>>>> Holland Computing Center >>>>>>>>> University of Nebraska-Lincoln >>>>>>>>> 402-472-6400 >>>>>>>>> From: users <users-boun...@lists.open-mpi.org >>>>>>>>> <mailto:users-boun...@lists.open-mpi.org>> on behalf of >>>>>>>>> r...@open-mpi.org <mailto:r...@open-mpi.org> <r...@open-mpi.org >>>>>>>>> <mailto:r...@open-mpi.org>> >>>>>>>>> Sent: Tuesday, August 23, 2016 4:03:07 PM >>>>>>>>> To: Open MPI Users >>>>>>>>> Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0 >>>>>>>>> >>>>>>>>> The IO forwarding messages all flow over the Ethernet, so the type of >>>>>>>>> fabric is irrelevant. The number of procs involved would definitely >>>>>>>>> have an impact, but that might not be due to the IO forwarding >>>>>>>>> subsystem. We know we have flow control issues with collectives like >>>>>>>>> Bcast that don’t have built-in synchronization points. How many reads >>>>>>>>> were you able to do before it hung? >>>>>>>>> >>>>>>>>> I was running it on my little test setup (2 nodes, using only a few >>>>>>>>> procs), but I’ll try scaling up and see what happens. I’ll also try >>>>>>>>> introducing some forced “syncs” on the Bcast and see if that solves >>>>>>>>> the issue. >>>>>>>>> >>>>>>>>> Ralph >>>>>>>>> >>>>>>>>>> On Aug 23, 2016, at 2:30 PM, Jingchao Zhang <zh...@unl.edu >>>>>>>>>> <mailto:zh...@unl.edu>> wrote: >>>>>>>>>> >>>>>>>>>> Hi Ralph, >>>>>>>>>> >>>>>>>>>> I tested v2.0.1rc1 with your code but has the same issue. I also >>>>>>>>>> installed v2.0.1rc1 on a different cluster which has Mellanox QDR >>>>>>>>>> Infiniband and get the same result. For the tests you have done, how >>>>>>>>>> many cores and nodes did you use? I can trigger the problem by using >>>>>>>>>> multiple nodes and each node with more than 10 cores. >>>>>>>>>> >>>>>>>>>> Thank you for looking into this. >>>>>>>>>> >>>>>>>>>> Dr. Jingchao Zhang >>>>>>>>>> Holland Computing Center >>>>>>>>>> University of Nebraska-Lincoln >>>>>>>>>> 402-472-6400 >>>>>>>>>> From: users <users-boun...@lists.open-mpi.org >>>>>>>>>> <mailto:users-boun...@lists.open-mpi.org>> on behalf of >>>>>>>>>> r...@open-mpi.org <mailto:r...@open-mpi.org> <r...@open-mpi.org >>>>>>>>>> <mailto:r...@open-mpi.org>> >>>>>>>>>> Sent: Monday, August 22, 2016 10:23:42 PM >>>>>>>>>> To: Open MPI Users >>>>>>>>>> Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0 >>>>>>>>>> >>>>>>>>>> FWIW: I just tested forwarding up to 100MBytes via stdin using the >>>>>>>>>> simple test shown below with OMPI v2.0.1rc1, and it worked fine. So >>>>>>>>>> I’d suggest upgrading when the official release comes out, or going >>>>>>>>>> ahead and at least testing 2.0.1rc1 on your machine. Or you can test >>>>>>>>>> this program with some input file and let me know if it works for >>>>>>>>>> you. >>>>>>>>>> >>>>>>>>>> Ralph >>>>>>>>>> >>>>>>>>>> #include <stdlib.h> >>>>>>>>>> #include <stdio.h> >>>>>>>>>> #include <string.h> >>>>>>>>>> #include <stdbool.h> >>>>>>>>>> #include <unistd.h> >>>>>>>>>> #include <mpi.h> >>>>>>>>>> >>>>>>>>>> #define ORTE_IOF_BASE_MSG_MAX 2048 >>>>>>>>>> >>>>>>>>>> int main(int argc, char *argv[]) >>>>>>>>>> { >>>>>>>>>> int i, rank, size, next, prev, tag = 201; >>>>>>>>>> int pos, msgsize, nbytes; >>>>>>>>>> bool done; >>>>>>>>>> char *msg; >>>>>>>>>> >>>>>>>>>> MPI_Init(&argc, &argv); >>>>>>>>>> MPI_Comm_rank(MPI_COMM_WORLD, &rank); >>>>>>>>>> MPI_Comm_size(MPI_COMM_WORLD, &size); >>>>>>>>>> >>>>>>>>>> fprintf(stderr, "Rank %d has cleared MPI_Init\n", rank); >>>>>>>>>> >>>>>>>>>> next = (rank + 1) % size; >>>>>>>>>> prev = (rank + size - 1) % size; >>>>>>>>>> msg = malloc(ORTE_IOF_BASE_MSG_MAX); >>>>>>>>>> pos = 0; >>>>>>>>>> nbytes = 0; >>>>>>>>>> >>>>>>>>>> if (0 == rank) { >>>>>>>>>> while (0 != (msgsize = read(0, msg, ORTE_IOF_BASE_MSG_MAX))) { >>>>>>>>>> fprintf(stderr, "Rank %d: sending blob %d\n", rank, pos); >>>>>>>>>> if (msgsize > 0) { >>>>>>>>>> MPI_Bcast(msg, ORTE_IOF_BASE_MSG_MAX, MPI_BYTE, 0, >>>>>>>>>> MPI_COMM_WORLD); >>>>>>>>>> } >>>>>>>>>> ++pos; >>>>>>>>>> nbytes += msgsize; >>>>>>>>>> } >>>>>>>>>> fprintf(stderr, "Rank %d: sending termination blob %d\n", >>>>>>>>>> rank, pos); >>>>>>>>>> memset(msg, 0, ORTE_IOF_BASE_MSG_MAX); >>>>>>>>>> MPI_Bcast(msg, ORTE_IOF_BASE_MSG_MAX, MPI_BYTE, 0, >>>>>>>>>> MPI_COMM_WORLD); >>>>>>>>>> MPI_Barrier(MPI_COMM_WORLD); >>>>>>>>>> } else { >>>>>>>>>> while (1) { >>>>>>>>>> MPI_Bcast(msg, ORTE_IOF_BASE_MSG_MAX, MPI_BYTE, 0, >>>>>>>>>> MPI_COMM_WORLD); >>>>>>>>>> fprintf(stderr, "Rank %d: recvd blob %d\n", rank, pos); >>>>>>>>>> ++pos; >>>>>>>>>> done = true; >>>>>>>>>> for (i=0; i < ORTE_IOF_BASE_MSG_MAX; i++) { >>>>>>>>>> if (0 != msg[i]) { >>>>>>>>>> done = false; >>>>>>>>>> break; >>>>>>>>>> } >>>>>>>>>> } >>>>>>>>>> if (done) { >>>>>>>>>> break; >>>>>>>>>> } >>>>>>>>>> } >>>>>>>>>> fprintf(stderr, "Rank %d: recv done\n", rank); >>>>>>>>>> MPI_Barrier(MPI_COMM_WORLD); >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> fprintf(stderr, "Rank %d has completed bcast\n", rank); >>>>>>>>>> MPI_Finalize(); >>>>>>>>>> return 0; >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> On Aug 22, 2016, at 3:40 PM, Jingchao Zhang <zh...@unl.edu >>>>>>>>>>> <mailto:zh...@unl.edu>> wrote: >>>>>>>>>>> >>>>>>>>>>> This might be a thin argument but we have many users running mpirun >>>>>>>>>>> in this way for years with no problem until this recent upgrade. >>>>>>>>>>> And some home-brewed mpi codes do not even have a standard way to >>>>>>>>>>> read the input files. Last time I checked, the openmpi manual still >>>>>>>>>>> claims it supports stdin >>>>>>>>>>> (https://www.open-mpi.org/doc/v2.0/man1/mpirun.1.php#sect14 >>>>>>>>>>> <https://www.open-mpi.org/doc/v2.0/man1/mpirun.1.php#sect14>). >>>>>>>>>>> Maybe I missed it but the v2.0 release notes did not mention any >>>>>>>>>>> changes to the behaviors of stdin as well. >>>>>>>>>>> >>>>>>>>>>> We can tell our users to run mpirun in the suggested way, but I do >>>>>>>>>>> hope someone can look into the issue and fix it. >>>>>>>>>>> >>>>>>>>>>> Dr. Jingchao Zhang >>>>>>>>>>> Holland Computing Center >>>>>>>>>>> University of Nebraska-Lincoln >>>>>>>>>>> 402-472-6400 >>>>>>>>>>> From: users <users-boun...@lists.open-mpi.org >>>>>>>>>>> <mailto:users-boun...@lists.open-mpi.org>> on behalf of >>>>>>>>>>> r...@open-mpi.org <mailto:r...@open-mpi.org> <r...@open-mpi.org >>>>>>>>>>> <mailto:r...@open-mpi.org>> >>>>>>>>>>> Sent: Monday, August 22, 2016 3:04:50 PM >>>>>>>>>>> To: Open MPI Users >>>>>>>>>>> Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0 >>>>>>>>>>> >>>>>>>>>>> Well, I can try to find time to take a look. However, I will >>>>>>>>>>> reiterate what Jeff H said - it is very unwise to rely on IO >>>>>>>>>>> forwarding. Much better to just directly read the file unless that >>>>>>>>>>> file is simply unavailable on the node where rank=0 is running. >>>>>>>>>>> >>>>>>>>>>>> On Aug 22, 2016, at 1:55 PM, Jingchao Zhang <zh...@unl.edu >>>>>>>>>>>> <mailto:zh...@unl.edu>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Here you can find the source code for lammps input >>>>>>>>>>>> https://github.com/lammps/lammps/blob/r13864/src/input.cpp >>>>>>>>>>>> <https://github.com/lammps/lammps/blob/r13864/src/input.cpp> >>>>>>>>>>>> >>>>>>>>>>>> Based on the gdb output, rank 0 stuck at line 167 >>>>>>>>>>>> if >>>>>>>>>>>> >>>>>>>>>>>> ( >>>>>>>>>>>> fgets >>>>>>>>>>>> (&line[m],maxline-m,infile) >>>>>>>>>>>> == >>>>>>>>>>>> NULL) >>>>>>>>>>>> and the rest threads stuck at line 203 >>>>>>>>>>>> MPI_Bcast(&n,1,MPI_INT,0,world); >>>>>>>>>>>> >>>>>>>>>>>> So rank 0 possibly hangs on the fgets() function. >>>>>>>>>>>> >>>>>>>>>>>> Here are the whole backtrace information: >>>>>>>>>>>> $ cat master.backtrace worker.backtrace >>>>>>>>>>>> #0 0x0000003c37cdb68d in read () from /lib64/libc.so.6 >>>>>>>>>>>> #1 0x0000003c37c71ca8 in _IO_new_file_underflow () from >>>>>>>>>>>> /lib64/libc.so.6 >>>>>>>>>>>> #2 0x0000003c37c737ae in _IO_default_uflow_internal () from >>>>>>>>>>>> /lib64/libc.so.6 >>>>>>>>>>>> #3 0x0000003c37c67e8a in _IO_getline_info_internal () from >>>>>>>>>>>> /lib64/libc.so.6 >>>>>>>>>>>> #4 0x0000003c37c66ce9 in fgets () from /lib64/libc.so.6 >>>>>>>>>>>> #5 0x00000000005c5a43 in LAMMPS_NS::Input::file() () at >>>>>>>>>>>> ../input.cpp:167 >>>>>>>>>>>> #6 0x00000000005d4236 in main () at ../main.cpp:31 >>>>>>>>>>>> #0 0x00002b1635d2ace2 in poll_dispatch () from >>>>>>>>>>>> /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libopen-pal.so.20 >>>>>>>>>>>> #1 0x00002b1635d1fa71 in opal_libevent2022_event_base_loop () >>>>>>>>>>>> from /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libopen-pal.so.20 >>>>>>>>>>>> #2 0x00002b1635ce4634 in opal_progress () from >>>>>>>>>>>> /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libopen-pal.so.20 >>>>>>>>>>>> #3 0x00002b16351b8fad in ompi_request_default_wait () from >>>>>>>>>>>> /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libmpi.so.20 >>>>>>>>>>>> #4 0x00002b16351fcb40 in ompi_coll_base_bcast_intra_generic () >>>>>>>>>>>> from /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libmpi.so.20 >>>>>>>>>>>> #5 0x00002b16351fd0c2 in ompi_coll_base_bcast_intra_binomial () >>>>>>>>>>>> from /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libmpi.so.20 >>>>>>>>>>>> #6 0x00002b1644fa6d9b in ompi_coll_tuned_bcast_intra_dec_fixed () >>>>>>>>>>>> from >>>>>>>>>>>> /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/openmpi/mca_coll_tuned.so >>>>>>>>>>>> #7 0x00002b16351cb4fb in PMPI_Bcast () from >>>>>>>>>>>> /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libmpi.so.20 >>>>>>>>>>>> #8 0x00000000005c5b5d in LAMMPS_NS::Input::file() () at >>>>>>>>>>>> ../input.cpp:203 >>>>>>>>>>>> #9 0x00000000005d4236 in main () at ../main.cpp:31 >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> >>>>>>>>>>>> Dr. Jingchao Zhang >>>>>>>>>>>> Holland Computing Center >>>>>>>>>>>> University of Nebraska-Lincoln >>>>>>>>>>>> 402-472-6400 >>>>>>>>>>>> From: users <users-boun...@lists.open-mpi.org >>>>>>>>>>>> <mailto:users-boun...@lists.open-mpi.org>> on behalf of >>>>>>>>>>>> r...@open-mpi.org <mailto:r...@open-mpi.org> <r...@open-mpi.org >>>>>>>>>>>> <mailto:r...@open-mpi.org>> >>>>>>>>>>>> Sent: Monday, August 22, 2016 2:17:10 PM >>>>>>>>>>>> To: Open MPI Users >>>>>>>>>>>> Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0 >>>>>>>>>>>> >>>>>>>>>>>> Hmmm...perhaps we can break this out a bit? The stdin will be >>>>>>>>>>>> going to your rank=0 proc. It sounds like you have some subsequent >>>>>>>>>>>> step that calls MPI_Bcast? >>>>>>>>>>>> >>>>>>>>>>>> Can you first verify that the input is being correctly delivered >>>>>>>>>>>> to rank=0? This will help us isolate if the problem is in the IO >>>>>>>>>>>> forwarding, or in the subsequent Bcast. >>>>>>>>>>>> >>>>>>>>>>>>> On Aug 22, 2016, at 1:11 PM, Jingchao Zhang <zh...@unl.edu >>>>>>>>>>>>> <mailto:zh...@unl.edu>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Hi all, >>>>>>>>>>>>> >>>>>>>>>>>>> We compiled openmpi/2.0.0 with gcc/6.1.0 and intel/13.1.3. Both >>>>>>>>>>>>> of them have odd behaviors when trying to read from standard >>>>>>>>>>>>> input. >>>>>>>>>>>>> >>>>>>>>>>>>> For example, if we start the application lammps across 4 nodes, >>>>>>>>>>>>> each node 16 cores, connected by Intel QDR Infiniband, mpirun >>>>>>>>>>>>> works fine for the 1st time, but always stuck in a few seconds >>>>>>>>>>>>> thereafter. >>>>>>>>>>>>> Command: >>>>>>>>>>>>> mpirun ./lmp_ompi_g++ < in.snr >>>>>>>>>>>>> in.snr is the Lammps input file. compiler is gcc/6.1. >>>>>>>>>>>>> >>>>>>>>>>>>> Instead, if we use >>>>>>>>>>>>> mpirun ./lmp_ompi_g++ -in in.snr >>>>>>>>>>>>> it works 100%. >>>>>>>>>>>>> >>>>>>>>>>>>> Some odd behaviors we gathered so far. >>>>>>>>>>>>> 1. For 1 node job, stdin always works. >>>>>>>>>>>>> 2. For multiple nodes, stdin works unstably when the number of >>>>>>>>>>>>> cores per node are relatively small. For example, for 2/3/4 >>>>>>>>>>>>> nodes, each node 8 cores, mpirun works most of the time. But for >>>>>>>>>>>>> each node with >8 cores, mpirun works the 1st time, then always >>>>>>>>>>>>> stuck. There seems to be a magic number when it stops working. >>>>>>>>>>>>> 3. We tested Quantum Expresso with compiler intel/13 and had the >>>>>>>>>>>>> same issue. >>>>>>>>>>>>> >>>>>>>>>>>>> We used gdb to debug and found when mpirun was stuck, the rest of >>>>>>>>>>>>> the processes were all waiting on mpi broadcast from the master >>>>>>>>>>>>> thread. The lammps binary, input file and gdb core files >>>>>>>>>>>>> (example.tar.bz2) can be downloaded from this link >>>>>>>>>>>>> https://drive.google.com/open?id=0B3Yj4QkZpI-dVWZtWmJ3ZXNVRGc >>>>>>>>>>>>> <https://drive.google.com/open?id=0B3Yj4QkZpI-dVWZtWmJ3ZXNVRGc> >>>>>>>>>>>>> >>>>>>>>>>>>> Extra information: >>>>>>>>>>>>> 1. Job scheduler is slurm. >>>>>>>>>>>>> 2. configure setup: >>>>>>>>>>>>> ./configure --prefix=$PREFIX \ >>>>>>>>>>>>> --with-hwloc=internal \ >>>>>>>>>>>>> --enable-mpirun-prefix-by-default \ >>>>>>>>>>>>> --with-slurm \ >>>>>>>>>>>>> --with-verbs \ >>>>>>>>>>>>> --with-psm \ >>>>>>>>>>>>> --disable-openib-connectx-xrc \ >>>>>>>>>>>>> --with-knem=/opt/knem-1.1.2.90mlnx1 \ >>>>>>>>>>>>> --with-cma >>>>>>>>>>>>> 3. openmpi-mca-params.conf file >>>>>>>>>>>>> orte_hetero_nodes=1 >>>>>>>>>>>>> hwloc_base_binding_policy=core >>>>>>>>>>>>> rmaps_base_mapping_policy=core >>>>>>>>>>>>> opal_cuda_support=0 >>>>>>>>>>>>> btl_openib_use_eager_rdma=0 >>>>>>>>>>>>> btl_openib_max_eager_rdma=0 >>>>>>>>>>>>> btl_openib_flags=1 >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> Jingchao >>>>>>>>>>>>> >>>>>>>>>>>>> Dr. Jingchao Zhang >>>>>>>>>>>>> Holland Computing Center >>>>>>>>>>>>> University of Nebraska-Lincoln >>>>>>>>>>>>> 402-472-6400 >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> users mailing list >>>>>>>>>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>>>>>>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> users mailing list >>>>>>>>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>>>>>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> users mailing list >>>>>>>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>>>>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> users mailing list >>>>>>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>>>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> users mailing list >>>>>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> >>>>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>> >>>>> <debug_info.txt>_______________________________________________ >>>>> users mailing list >>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>> >>>> _______________________________________________ >>>> users mailing list >>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>> >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> >> >> -- >> Jeff Squyres >> jsquy...@cisco.com <mailto:jsquy...@cisco.com> >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users > > _______________________________________________ > users mailing list > users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users