Hmmm...well, the problem appears to be that we aren’t setting up the input 
channel to read stdin. This happens immediately after the application is 
launched - there is no “if” clause or anything else in front of it. The only 
way it wouldn’t get called is if all the procs weren’t launched, but that 
appears to be happening, yes?

Hence my confusion - there is no test in front of that print statement now, and 
yet we aren’t seeing the code being called.

Could you please add “-mca plm_base_verbose 5” to your cmd line? We should see 
a debug statement print that contains "plm:base:launch wiring up iof for job”



> On Aug 30, 2016, at 11:40 AM, Jingchao Zhang <zh...@unl.edu> wrote:
> 
> I checked again and as far as I can tell, everything was setup correctly. I 
> added "HCC debug" to the output message to make sure it's the correct plugin. 
> 
> The updated outputs:
> $ mpirun ./a.out < test.in
> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC debug: 
> [[26513,0],0] iof:hnp pushing fd 35 for process [[26513,1],0]
> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC debug: 
> [[26513,0],0] iof:hnp pushing fd 41 for process [[26513,1],0]
> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC debug: 
> [[26513,0],0] iof:hnp pushing fd 43 for process [[26513,1],0]
> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC debug: 
> [[26513,0],0] iof:hnp pushing fd 37 for process [[26513,1],1]
> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC debug: 
> [[26513,0],0] iof:hnp pushing fd 46 for process [[26513,1],1]
> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC debug: 
> [[26513,0],0] iof:hnp pushing fd 49 for process [[26513,1],1]
> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC debug: 
> [[26513,0],0] iof:hnp pushing fd 38 for process [[26513,1],2]
> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC debug: 
> [[26513,0],0] iof:hnp pushing fd 50 for process [[26513,1],2]
> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC debug: 
> [[26513,0],0] iof:hnp pushing fd 52 for process [[26513,1],2]
> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC debug: 
> [[26513,0],0] iof:hnp pushing fd 42 for process [[26513,1],3]
> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC debug: 
> [[26513,0],0] iof:hnp pushing fd 53 for process [[26513,1],3]
> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC debug: 
> [[26513,0],0] iof:hnp pushing fd 55 for process [[26513,1],3]
> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC debug: 
> [[26513,0],0] iof:hnp pushing fd 45 for process [[26513,1],4]
> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC debug: 
> [[26513,0],0] iof:hnp pushing fd 56 for process [[26513,1],4]
> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC debug: 
> [[26513,0],0] iof:hnp pushing fd 58 for process [[26513,1],4]
> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC debug: 
> [[26513,0],0] iof:hnp pushing fd 47 for process [[26513,1],5]
> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC debug: 
> [[26513,0],0] iof:hnp pushing fd 59 for process [[26513,1],5]
> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC debug: 
> [[26513,0],0] iof:hnp pushing fd 61 for process [[26513,1],5]
> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC debug: 
> [[26513,0],0] iof:hnp pushing fd 51 for process [[26513,1],6]
> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC debug: 
> [[26513,0],0] iof:hnp pushing fd 62 for process [[26513,1],6]
> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC debug: 
> [[26513,0],0] iof:hnp pushing fd 64 for process [[26513,1],6]
> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC debug: 
> [[26513,0],0] iof:hnp pushing fd 57 for process [[26513,1],7]
> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC debug: 
> [[26513,0],0] iof:hnp pushing fd 66 for process [[26513,1],7]
> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC debug: 
> [[26513,0],0] iof:hnp pushing fd 68 for process [[26513,1],7]
> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC debug: 
> [[26513,0],0] iof:hnp pushing fd 63 for process [[26513,1],8]
> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC debug: 
> [[26513,0],0] iof:hnp pushing fd 70 for process [[26513,1],8]
> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC debug: 
> [[26513,0],0] iof:hnp pushing fd 72 for process [[26513,1],8]
> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC debug: 
> [[26513,0],0] iof:hnp pushing fd 67 for process [[26513,1],9]
> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC debug: 
> [[26513,0],0] iof:hnp pushing fd 74 for process [[26513,1],9]
> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC debug: 
> [[26513,0],0] iof:hnp pushing fd 76 for process [[26513,1],9]
> Rank 1 has cleared MPI_Init
> Rank 3 has cleared MPI_Init
> Rank 4 has cleared MPI_Init
> Rank 5 has cleared MPI_Init
> Rank 6 has cleared MPI_Init
> Rank 7 has cleared MPI_Init
> Rank 0 has cleared MPI_Init
> Rank 2 has cleared MPI_Init
> Rank 8 has cleared MPI_Init
> Rank 9 has cleared MPI_Init
> Rank 10 has cleared MPI_Init
> Rank 11 has cleared MPI_Init
> Rank 12 has cleared MPI_Init
> Rank 13 has cleared MPI_Init
> Rank 16 has cleared MPI_Init
> Rank 17 has cleared MPI_Init
> Rank 18 has cleared MPI_Init
> Rank 14 has cleared MPI_Init
> Rank 15 has cleared MPI_Init
> Rank 19 has cleared MPI_Init
> 
> 
> The part of code I changed in file ./orte/mca/iof/hnp/iof_hnp.c
> 
>     opal_output(0,
>                          "HCC debug: %s iof:hnp pushing fd %d for process %s",
>                          ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
>                          fd, ORTE_NAME_PRINT(dst_name));
> 
>     /* don't do this if the dst vpid is invalid or the fd is negative! */
>     if (ORTE_VPID_INVALID == dst_name->vpid || fd < 0) {
>         return ORTE_SUCCESS;
>     }
> 
> /*    OPAL_OUTPUT_VERBOSE((1, orte_iof_base_framework.framework_output,
>                          "%s iof:hnp pushing fd %d for process %s",
>                          ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
>                          fd, ORTE_NAME_PRINT(dst_name)));
> */
> 
> From: users <users-boun...@lists.open-mpi.org 
> <mailto:users-boun...@lists.open-mpi.org>> on behalf of r...@open-mpi.org 
> <mailto:r...@open-mpi.org> <r...@open-mpi.org <mailto:r...@open-mpi.org>>
> Sent: Monday, August 29, 2016 11:42:00 AM
> To: Open MPI Users
> Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0
>  
> I’m sorry, but something is simply very wrong here. Are you sure you are 
> pointed at the correct LD_LIBRARY_PATH? Perhaps add a “BOO” or something at 
> the front of the output message to ensure we are using the correct plugin?
> 
> This looks to me like you must be picking up a stale library somewhere.
> 
>> On Aug 29, 2016, at 10:29 AM, Jingchao Zhang <zh...@unl.edu 
>> <mailto:zh...@unl.edu>> wrote:
>> 
>> Hi Ralph,
>> 
>> I used the tarball from Aug 26 and added the patch. Tested with 2 nodes, 10 
>> cores/node. Please see the results below:
>> 
>> $ mpirun ./a.out < test.in
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>> [[43954,0],0] iof:hnp pushing fd 35 for process [[43954,1],0]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>> [[43954,0],0] iof:hnp pushing fd 41 for process [[43954,1],0]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>> [[43954,0],0] iof:hnp pushing fd 43 for process [[43954,1],0]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>> [[43954,0],0] iof:hnp pushing fd 37 for process [[43954,1],1]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>> [[43954,0],0] iof:hnp pushing fd 46 for process [[43954,1],1]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>> [[43954,0],0] iof:hnp pushing fd 49 for process [[43954,1],1]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>> [[43954,0],0] iof:hnp pushing fd 38 for process [[43954,1],2]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>> [[43954,0],0] iof:hnp pushing fd 50 for process [[43954,1],2]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>> [[43954,0],0] iof:hnp pushing fd 52 for process [[43954,1],2]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>> [[43954,0],0] iof:hnp pushing fd 42 for process [[43954,1],3]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>> [[43954,0],0] iof:hnp pushing fd 53 for process [[43954,1],3]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>> [[43954,0],0] iof:hnp pushing fd 55 for process [[43954,1],3]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>> [[43954,0],0] iof:hnp pushing fd 45 for process [[43954,1],4]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>> [[43954,0],0] iof:hnp pushing fd 56 for process [[43954,1],4]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>> [[43954,0],0] iof:hnp pushing fd 58 for process [[43954,1],4]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>> [[43954,0],0] iof:hnp pushing fd 47 for process [[43954,1],5]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>> [[43954,0],0] iof:hnp pushing fd 59 for process [[43954,1],5]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>> [[43954,0],0] iof:hnp pushing fd 61 for process [[43954,1],5]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>> [[43954,0],0] iof:hnp pushing fd 57 for process [[43954,1],6]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>> [[43954,0],0] iof:hnp pushing fd 64 for process [[43954,1],6]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>> [[43954,0],0] iof:hnp pushing fd 66 for process [[43954,1],6]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>> [[43954,0],0] iof:hnp pushing fd 62 for process [[43954,1],7]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>> [[43954,0],0] iof:hnp pushing fd 68 for process [[43954,1],7]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>> [[43954,0],0] iof:hnp pushing fd 70 for process [[43954,1],7]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>> [[43954,0],0] iof:hnp pushing fd 65 for process [[43954,1],8]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>> [[43954,0],0] iof:hnp pushing fd 72 for process [[43954,1],8]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>> [[43954,0],0] iof:hnp pushing fd 74 for process [[43954,1],8]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>> [[43954,0],0] iof:hnp pushing fd 75 for process [[43954,1],9]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>> [[43954,0],0] iof:hnp pushing fd 79 for process [[43954,1],9]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>> [[43954,0],0] iof:hnp pushing fd 81 for process [[43954,1],9]
>> Rank 5 has cleared MPI_Init
>> Rank 9 has cleared MPI_Init
>> Rank 1 has cleared MPI_Init
>> Rank 2 has cleared MPI_Init
>> Rank 3 has cleared MPI_Init
>> Rank 4 has cleared MPI_Init
>> Rank 8 has cleared MPI_Init
>> Rank 0 has cleared MPI_Init
>> Rank 6 has cleared MPI_Init
>> Rank 7 has cleared MPI_Init
>> Rank 14 has cleared MPI_Init
>> Rank 15 has cleared MPI_Init
>> Rank 16 has cleared MPI_Init
>> Rank 18 has cleared MPI_Init
>> Rank 10 has cleared MPI_Init
>> Rank 11 has cleared MPI_Init
>> Rank 12 has cleared MPI_Init
>> Rank 13 has cleared MPI_Init
>> Rank 17 has cleared MPI_Init
>> Rank 19 has cleared MPI_Init
>> 
>> Thanks,
>> 
>> Dr. Jingchao Zhang
>> Holland Computing Center
>> University of Nebraska-Lincoln
>> 402-472-6400
>> From: users <users-boun...@lists.open-mpi.org 
>> <mailto:users-boun...@lists.open-mpi.org>> on behalf of r...@open-mpi.org 
>> <mailto:r...@open-mpi.org> <r...@open-mpi.org <mailto:r...@open-mpi.org>>
>> Sent: Saturday, August 27, 2016 12:31:53 PM
>> To: Open MPI Users
>> Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0
>>  
>> I am finding this impossible to replicate, so something odd must be going 
>> on. Can you please (a) pull down the latest v2.0.1 nightly tarball, and (b) 
>> add this patch to it?
>> 
>> diff --git a/orte/mca/iof/hnp/iof_hnp.c b/orte/mca/iof/hnp/iof_hnp.c
>> old mode 100644
>> new mode 100755
>> index 512fcdb..362ff46
>> --- a/orte/mca/iof/hnp/iof_hnp.c
>> +++ b/orte/mca/iof/hnp/iof_hnp.c
>> @@ -143,16 +143,17 @@ static int hnp_push(const orte_process_name_t* 
>> dst_name, orte_iof_tag_t src_tag,
>>      int np, numdigs;
>>      orte_ns_cmp_bitmask_t mask;
>>  
>> +    opal_output(0,
>> +                         "%s iof:hnp pushing fd %d for process %s",
>> +                         ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
>> +                         fd, ORTE_NAME_PRINT(dst_name));
>> +
>>      /* don't do this if the dst vpid is invalid or the fd is negative! */
>>      if (ORTE_VPID_INVALID == dst_name->vpid || fd < 0) {
>>          return ORTE_SUCCESS;
>>      }
>>  
>> -    OPAL_OUTPUT_VERBOSE((1, orte_iof_base_framework.framework_output,
>> -                         "%s iof:hnp pushing fd %d for process %s",
>> -                         ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
>> -                         fd, ORTE_NAME_PRINT(dst_name)));
>> -
>>      if (!(src_tag & ORTE_IOF_STDIN)) {
>>          /* set the file descriptor to non-blocking - do this before we setup
>>           * and activate the read event in case it fires right away
>> 
>> 
>> You can then run the test again without the "--mca iof_base_verbose 100” 
>> flag to reduce the chatter - this print statement will tell me what I need 
>> to know.
>> 
>> Thanks!
>> Ralph
>> 
>> 
>>> On Aug 25, 2016, at 8:19 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com 
>>> <mailto:jsquy...@cisco.com>> wrote:
>>> 
>>> The IOF fix PR for v2.0.1 was literally just merged a few minutes ago; it 
>>> wasn't in last night's tarball.
>>> 
>>> 
>>> 
>>>> On Aug 25, 2016, at 10:59 AM, r...@open-mpi.org <mailto:r...@open-mpi.org> 
>>>> wrote:
>>>> 
>>>> ??? Weird - can you send me an updated output of that last test we ran?
>>>> 
>>>>> On Aug 25, 2016, at 7:51 AM, Jingchao Zhang <zh...@unl.edu 
>>>>> <mailto:zh...@unl.edu>> wrote:
>>>>> 
>>>>> Hi Ralph,
>>>>> 
>>>>> I saw the pull request and did a test with v2.0.1rc1, but the problem 
>>>>> persists. Any ideas?
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Dr. Jingchao Zhang
>>>>> Holland Computing Center
>>>>> University of Nebraska-Lincoln
>>>>> 402-472-6400
>>>>> From: users <users-boun...@lists.open-mpi.org 
>>>>> <mailto:users-boun...@lists.open-mpi.org>> on behalf of r...@open-mpi.org 
>>>>> <mailto:r...@open-mpi.org> <r...@open-mpi.org <mailto:r...@open-mpi.org>>
>>>>> Sent: Wednesday, August 24, 2016 1:27:28 PM
>>>>> To: Open MPI Users
>>>>> Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0
>>>>> 
>>>>> Bingo - found it, fix submitted and hope to get it into 2.0.1
>>>>> 
>>>>> Thanks for the assist!
>>>>> Ralph
>>>>> 
>>>>> 
>>>>>> On Aug 24, 2016, at 12:15 PM, Jingchao Zhang <zh...@unl.edu 
>>>>>> <mailto:zh...@unl.edu>> wrote:
>>>>>> 
>>>>>> I configured v2.0.1rc1 with --enable-debug and ran the test with --mca 
>>>>>> iof_base_verbose 100. I also added -display-devel-map in case it 
>>>>>> provides some useful information.
>>>>>> 
>>>>>> Test job has 2 nodes, each node 10 cores. Rank 0 and mpirun command on 
>>>>>> the same node.
>>>>>> $ mpirun -display-devel-map --mca iof_base_verbose 100 ./a.out < test.in 
>>>>>> &> debug_info.txt
>>>>>> 
>>>>>> The debug_info.txt is attached. 
>>>>>> 
>>>>>> Dr. Jingchao Zhang
>>>>>> Holland Computing Center
>>>>>> University of Nebraska-Lincoln
>>>>>> 402-472-6400
>>>>>> From: users <users-boun...@lists.open-mpi.org 
>>>>>> <mailto:users-boun...@lists.open-mpi.org>> on behalf of 
>>>>>> r...@open-mpi.org <mailto:r...@open-mpi.org> <r...@open-mpi.org 
>>>>>> <mailto:r...@open-mpi.org>>
>>>>>> Sent: Wednesday, August 24, 2016 12:14:26 PM
>>>>>> To: Open MPI Users
>>>>>> Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0
>>>>>> 
>>>>>> Afraid I can’t replicate a problem at all, whether rank=0 is local or 
>>>>>> not. I’m also using bash, but on CentOS-7, so I suspect the OS is the 
>>>>>> difference.
>>>>>> 
>>>>>> Can you configure OMPI with --enable-debug, and then run the test again 
>>>>>> with --mca iof_base_verbose 100? It will hopefully tell us something 
>>>>>> about why the IO subsystem is stuck.
>>>>>> 
>>>>>> 
>>>>>>> On Aug 24, 2016, at 8:46 AM, Jingchao Zhang <zh...@unl.edu 
>>>>>>> <mailto:zh...@unl.edu>> wrote:
>>>>>>> 
>>>>>>> Hi Ralph,
>>>>>>> 
>>>>>>> For our tests, rank 0 is always on the same node with mpirun. I just 
>>>>>>> tested mpirun with -nolocal and it still hangs. 
>>>>>>> 
>>>>>>> Information on shell and OS
>>>>>>> $ echo $0
>>>>>>> -bash
>>>>>>> 
>>>>>>> $ lsb_release -a
>>>>>>> LSB Version:    
>>>>>>> :base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch
>>>>>>> Distributor ID: Scientific
>>>>>>> Description:    Scientific Linux release 6.8 (Carbon)
>>>>>>> Release:        6.8
>>>>>>> Codename:       Carbon
>>>>>>> 
>>>>>>> $ uname -a
>>>>>>> Linux login.crane.hcc.unl.edu <http://login.crane.hcc.unl.edu/> 
>>>>>>> 2.6.32-642.3.1.el6.x86_64 #1 SMP Tue Jul 12 11:25:51 CDT 2016 x86_64 
>>>>>>> x86_64 x86_64 GNU/Linux
>>>>>>> 
>>>>>>> 
>>>>>>> Dr. Jingchao Zhang
>>>>>>> Holland Computing Center
>>>>>>> University of Nebraska-Lincoln
>>>>>>> 402-472-6400
>>>>>>> From: users <users-boun...@lists.open-mpi.org 
>>>>>>> <mailto:users-boun...@lists.open-mpi.org>> on behalf of 
>>>>>>> r...@open-mpi.org <mailto:r...@open-mpi.org> <r...@open-mpi.org 
>>>>>>> <mailto:r...@open-mpi.org>>
>>>>>>> Sent: Tuesday, August 23, 2016 8:14:48 PM
>>>>>>> To: Open MPI Users
>>>>>>> Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0
>>>>>>> 
>>>>>>> Hmmm...that’s a good point. Rank 0 and mpirun are always on the same 
>>>>>>> node on my cluster. I’ll give it a try.
>>>>>>> 
>>>>>>> Jingchao: is rank 0 on the node with mpirun, or on a remote node?
>>>>>>> 
>>>>>>> 
>>>>>>>> On Aug 23, 2016, at 5:58 PM, Gilles Gouaillardet <gil...@rist.or.jp 
>>>>>>>> <mailto:gil...@rist.or.jp>> wrote:
>>>>>>>> 
>>>>>>>> Ralph,
>>>>>>>> 
>>>>>>>> did you run task 0 and mpirun on different nodes ?
>>>>>>>> 
>>>>>>>> i observed some random hangs, though i cannot blame openmpi 100% yet
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> 
>>>>>>>> Gilles
>>>>>>>> 
>>>>>>>> On 8/24/2016 9:41 AM, r...@open-mpi.org <mailto:r...@open-mpi.org> 
>>>>>>>> wrote:
>>>>>>>>> Very strange. I cannot reproduce it as I’m able to run any number of 
>>>>>>>>> nodes and procs, pushing over 100Mbytes thru without any problem.
>>>>>>>>> 
>>>>>>>>> Which leads me to suspect that the issue here is with the tty 
>>>>>>>>> interface. Can you tell me what shell and OS you are running?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On Aug 23, 2016, at 3:25 PM, Jingchao Zhang <zh...@unl.edu 
>>>>>>>>>> <mailto:zh...@unl.edu>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Everything stuck at MPI_Init. For a test job with 2 nodes and 10 
>>>>>>>>>> cores each node, I got the following
>>>>>>>>>> 
>>>>>>>>>> $ mpirun ./a.out < test.in
>>>>>>>>>> Rank 2 has cleared MPI_Init
>>>>>>>>>> Rank 4 has cleared MPI_Init
>>>>>>>>>> Rank 7 has cleared MPI_Init
>>>>>>>>>> Rank 8 has cleared MPI_Init
>>>>>>>>>> Rank 0 has cleared MPI_Init
>>>>>>>>>> Rank 5 has cleared MPI_Init
>>>>>>>>>> Rank 6 has cleared MPI_Init
>>>>>>>>>> Rank 9 has cleared MPI_Init
>>>>>>>>>> Rank 1 has cleared MPI_Init
>>>>>>>>>> Rank 16 has cleared MPI_Init
>>>>>>>>>> Rank 19 has cleared MPI_Init
>>>>>>>>>> Rank 10 has cleared MPI_Init
>>>>>>>>>> Rank 11 has cleared MPI_Init
>>>>>>>>>> Rank 12 has cleared MPI_Init
>>>>>>>>>> Rank 13 has cleared MPI_Init
>>>>>>>>>> Rank 14 has cleared MPI_Init
>>>>>>>>>> Rank 15 has cleared MPI_Init
>>>>>>>>>> Rank 17 has cleared MPI_Init
>>>>>>>>>> Rank 18 has cleared MPI_Init
>>>>>>>>>> Rank 3 has cleared MPI_Init
>>>>>>>>>> 
>>>>>>>>>> then it just hanged.
>>>>>>>>>> 
>>>>>>>>>> --Jingchao
>>>>>>>>>> 
>>>>>>>>>> Dr. Jingchao Zhang
>>>>>>>>>> Holland Computing Center
>>>>>>>>>> University of Nebraska-Lincoln
>>>>>>>>>> 402-472-6400
>>>>>>>>>> From: users <users-boun...@lists.open-mpi.org 
>>>>>>>>>> <mailto:users-boun...@lists.open-mpi.org>> on behalf of 
>>>>>>>>>> r...@open-mpi.org <mailto:r...@open-mpi.org> <r...@open-mpi.org 
>>>>>>>>>> <mailto:r...@open-mpi.org>>
>>>>>>>>>> Sent: Tuesday, August 23, 2016 4:03:07 PM
>>>>>>>>>> To: Open MPI Users
>>>>>>>>>> Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0
>>>>>>>>>> 
>>>>>>>>>> The IO forwarding messages all flow over the Ethernet, so the type 
>>>>>>>>>> of fabric is irrelevant. The number of procs involved would 
>>>>>>>>>> definitely have an impact, but that might not be due to the IO 
>>>>>>>>>> forwarding subsystem. We know we have flow control issues with 
>>>>>>>>>> collectives like Bcast that don’t have built-in synchronization 
>>>>>>>>>> points. How many reads were you able to do before it hung?
>>>>>>>>>> 
>>>>>>>>>> I was running it on my little test setup (2 nodes, using only a few 
>>>>>>>>>> procs), but I’ll try scaling up and see what happens. I’ll also try 
>>>>>>>>>> introducing some forced “syncs” on the Bcast and see if that solves 
>>>>>>>>>> the issue.
>>>>>>>>>> 
>>>>>>>>>> Ralph
>>>>>>>>>> 
>>>>>>>>>>> On Aug 23, 2016, at 2:30 PM, Jingchao Zhang <zh...@unl.edu 
>>>>>>>>>>> <mailto:zh...@unl.edu>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Hi Ralph,
>>>>>>>>>>> 
>>>>>>>>>>> I tested v2.0.1rc1 with your code but has the same issue. I also 
>>>>>>>>>>> installed v2.0.1rc1 on a different cluster which has Mellanox QDR 
>>>>>>>>>>> Infiniband and get the same result. For the tests you have done, 
>>>>>>>>>>> how many cores and nodes did you use? I can trigger the problem by 
>>>>>>>>>>> using multiple nodes and each node with more than 10 cores. 
>>>>>>>>>>> 
>>>>>>>>>>> Thank you for looking into this.
>>>>>>>>>>> 
>>>>>>>>>>> Dr. Jingchao Zhang
>>>>>>>>>>> Holland Computing Center
>>>>>>>>>>> University of Nebraska-Lincoln
>>>>>>>>>>> 402-472-6400
>>>>>>>>>>> From: users <users-boun...@lists.open-mpi.org 
>>>>>>>>>>> <mailto:users-boun...@lists.open-mpi.org>> on behalf of 
>>>>>>>>>>> r...@open-mpi.org <mailto:r...@open-mpi.org> <r...@open-mpi.org 
>>>>>>>>>>> <mailto:r...@open-mpi.org>>
>>>>>>>>>>> Sent: Monday, August 22, 2016 10:23:42 PM
>>>>>>>>>>> To: Open MPI Users
>>>>>>>>>>> Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0
>>>>>>>>>>> 
>>>>>>>>>>> FWIW: I just tested forwarding up to 100MBytes via stdin using the 
>>>>>>>>>>> simple test shown below with OMPI v2.0.1rc1, and it worked fine. So 
>>>>>>>>>>> I’d suggest upgrading when the official release comes out, or going 
>>>>>>>>>>> ahead and at least testing 2.0.1rc1 on your machine. Or you can 
>>>>>>>>>>> test this program with some input file and let me know if it works 
>>>>>>>>>>> for you.
>>>>>>>>>>> 
>>>>>>>>>>> Ralph
>>>>>>>>>>> 
>>>>>>>>>>> #include <stdlib.h>
>>>>>>>>>>> #include <stdio.h>
>>>>>>>>>>> #include <string.h>
>>>>>>>>>>> #include <stdbool.h>
>>>>>>>>>>> #include <unistd.h>
>>>>>>>>>>> #include <mpi.h>
>>>>>>>>>>> 
>>>>>>>>>>> #define ORTE_IOF_BASE_MSG_MAX   2048
>>>>>>>>>>> 
>>>>>>>>>>> int main(int argc, char *argv[])
>>>>>>>>>>> {
>>>>>>>>>>>    int i, rank, size, next, prev, tag = 201;
>>>>>>>>>>>    int pos, msgsize, nbytes;
>>>>>>>>>>>    bool done;
>>>>>>>>>>>    char *msg;
>>>>>>>>>>> 
>>>>>>>>>>>    MPI_Init(&argc, &argv);
>>>>>>>>>>>    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>>>>>>>>>>>    MPI_Comm_size(MPI_COMM_WORLD, &size);
>>>>>>>>>>> 
>>>>>>>>>>>    fprintf(stderr, "Rank %d has cleared MPI_Init\n", rank);
>>>>>>>>>>> 
>>>>>>>>>>>    next = (rank + 1) % size;
>>>>>>>>>>>    prev = (rank + size - 1) % size;
>>>>>>>>>>>    msg = malloc(ORTE_IOF_BASE_MSG_MAX);
>>>>>>>>>>>    pos = 0;
>>>>>>>>>>>    nbytes = 0;
>>>>>>>>>>> 
>>>>>>>>>>>    if (0 == rank) {
>>>>>>>>>>>        while (0 != (msgsize = read(0, msg, ORTE_IOF_BASE_MSG_MAX))) 
>>>>>>>>>>> {
>>>>>>>>>>>            fprintf(stderr, "Rank %d: sending blob %d\n", rank, pos);
>>>>>>>>>>>            if (msgsize > 0) {
>>>>>>>>>>>                MPI_Bcast(msg, ORTE_IOF_BASE_MSG_MAX, MPI_BYTE, 0, 
>>>>>>>>>>> MPI_COMM_WORLD);
>>>>>>>>>>>            }
>>>>>>>>>>>            ++pos;
>>>>>>>>>>>            nbytes += msgsize;
>>>>>>>>>>>        }
>>>>>>>>>>>        fprintf(stderr, "Rank %d: sending termination blob %d\n", 
>>>>>>>>>>> rank, pos);
>>>>>>>>>>>        memset(msg, 0, ORTE_IOF_BASE_MSG_MAX);
>>>>>>>>>>>        MPI_Bcast(msg, ORTE_IOF_BASE_MSG_MAX, MPI_BYTE, 0, 
>>>>>>>>>>> MPI_COMM_WORLD);
>>>>>>>>>>>        MPI_Barrier(MPI_COMM_WORLD);
>>>>>>>>>>>    } else {
>>>>>>>>>>>        while (1) {
>>>>>>>>>>>            MPI_Bcast(msg, ORTE_IOF_BASE_MSG_MAX, MPI_BYTE, 0, 
>>>>>>>>>>> MPI_COMM_WORLD);
>>>>>>>>>>>            fprintf(stderr, "Rank %d: recvd blob %d\n", rank, pos);
>>>>>>>>>>>            ++pos;
>>>>>>>>>>>            done = true;
>>>>>>>>>>>            for (i=0; i < ORTE_IOF_BASE_MSG_MAX; i++) {
>>>>>>>>>>>                if (0 != msg[i]) {
>>>>>>>>>>>                    done = false;
>>>>>>>>>>>                    break;
>>>>>>>>>>>                }
>>>>>>>>>>>            }
>>>>>>>>>>>            if (done) {
>>>>>>>>>>>                break;
>>>>>>>>>>>            }
>>>>>>>>>>>        }
>>>>>>>>>>>        fprintf(stderr, "Rank %d: recv done\n", rank);
>>>>>>>>>>>        MPI_Barrier(MPI_COMM_WORLD);
>>>>>>>>>>>    }
>>>>>>>>>>> 
>>>>>>>>>>>    fprintf(stderr, "Rank %d has completed bcast\n", rank);
>>>>>>>>>>>    MPI_Finalize();
>>>>>>>>>>>    return 0;
>>>>>>>>>>> }
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> On Aug 22, 2016, at 3:40 PM, Jingchao Zhang <zh...@unl.edu 
>>>>>>>>>>>> <mailto:zh...@unl.edu>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> This might be a thin argument but we have many users running 
>>>>>>>>>>>> mpirun in this way for years with no problem until this recent 
>>>>>>>>>>>> upgrade. And some home-brewed mpi codes do not even have a 
>>>>>>>>>>>> standard way to read the input files. Last time I checked, the 
>>>>>>>>>>>> openmpi manual still claims it supports stdin 
>>>>>>>>>>>> (https://www.open-mpi.org/doc/v2.0/man1/mpirun.1.php#sect14 
>>>>>>>>>>>> <https://www.open-mpi.org/doc/v2.0/man1/mpirun.1.php#sect14>). 
>>>>>>>>>>>> Maybe I missed it but the v2.0 release notes did not mention any 
>>>>>>>>>>>> changes to the behaviors of stdin as well.
>>>>>>>>>>>> 
>>>>>>>>>>>> We can tell our users to run mpirun in the suggested way, but I do 
>>>>>>>>>>>> hope someone can look into the issue and fix it.
>>>>>>>>>>>> 
>>>>>>>>>>>> Dr. Jingchao Zhang
>>>>>>>>>>>> Holland Computing Center
>>>>>>>>>>>> University of Nebraska-Lincoln
>>>>>>>>>>>> 402-472-6400
>>>>>>>>>>>> From: users <users-boun...@lists.open-mpi.org 
>>>>>>>>>>>> <mailto:users-boun...@lists.open-mpi.org>> on behalf of 
>>>>>>>>>>>> r...@open-mpi.org <mailto:r...@open-mpi.org> <r...@open-mpi.org 
>>>>>>>>>>>> <mailto:r...@open-mpi.org>>
>>>>>>>>>>>> Sent: Monday, August 22, 2016 3:04:50 PM
>>>>>>>>>>>> To: Open MPI Users
>>>>>>>>>>>> Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0
>>>>>>>>>>>> 
>>>>>>>>>>>> Well, I can try to find time to take a look. However, I will 
>>>>>>>>>>>> reiterate what Jeff H said - it is very unwise to rely on IO 
>>>>>>>>>>>> forwarding. Much better to just directly read the file unless that 
>>>>>>>>>>>> file is simply unavailable on the node where rank=0 is running.
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Aug 22, 2016, at 1:55 PM, Jingchao Zhang <zh...@unl.edu 
>>>>>>>>>>>>> <mailto:zh...@unl.edu>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Here you can find the source code for lammps input 
>>>>>>>>>>>>> https://github.com/lammps/lammps/blob/r13864/src/input.cpp 
>>>>>>>>>>>>> <https://github.com/lammps/lammps/blob/r13864/src/input.cpp>
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Based on the gdb output, rank 0 stuck at line 167
>>>>>>>>>>>>> if
>>>>>>>>>>>>> 
>>>>>>>>>>>>> (
>>>>>>>>>>>>> fgets
>>>>>>>>>>>>> (&line[m],maxline-m,infile)
>>>>>>>>>>>>> == 
>>>>>>>>>>>>> NULL)
>>>>>>>>>>>>> and the rest threads stuck at line 203
>>>>>>>>>>>>> MPI_Bcast(&n,1,MPI_INT,0,world);
>>>>>>>>>>>>> 
>>>>>>>>>>>>> So rank 0 possibly hangs on the fgets() function.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Here are the whole backtrace information:
>>>>>>>>>>>>> $ cat master.backtrace worker.backtrace
>>>>>>>>>>>>> #0  0x0000003c37cdb68d in read () from /lib64/libc.so.6
>>>>>>>>>>>>> #1  0x0000003c37c71ca8 in _IO_new_file_underflow () from 
>>>>>>>>>>>>> /lib64/libc.so.6
>>>>>>>>>>>>> #2  0x0000003c37c737ae in _IO_default_uflow_internal () from 
>>>>>>>>>>>>> /lib64/libc.so.6
>>>>>>>>>>>>> #3  0x0000003c37c67e8a in _IO_getline_info_internal () from 
>>>>>>>>>>>>> /lib64/libc.so.6
>>>>>>>>>>>>> #4  0x0000003c37c66ce9 in fgets () from /lib64/libc.so.6
>>>>>>>>>>>>> #5  0x00000000005c5a43 in LAMMPS_NS::Input::file() () at 
>>>>>>>>>>>>> ../input.cpp:167
>>>>>>>>>>>>> #6  0x00000000005d4236 in main () at ../main.cpp:31
>>>>>>>>>>>>> #0  0x00002b1635d2ace2 in poll_dispatch () from 
>>>>>>>>>>>>> /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libopen-pal.so.20
>>>>>>>>>>>>> #1  0x00002b1635d1fa71 in opal_libevent2022_event_base_loop ()
>>>>>>>>>>>>>   from /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libopen-pal.so.20
>>>>>>>>>>>>> #2  0x00002b1635ce4634 in opal_progress () from 
>>>>>>>>>>>>> /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libopen-pal.so.20
>>>>>>>>>>>>> #3  0x00002b16351b8fad in ompi_request_default_wait () from 
>>>>>>>>>>>>> /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libmpi.so.20
>>>>>>>>>>>>> #4  0x00002b16351fcb40 in ompi_coll_base_bcast_intra_generic ()
>>>>>>>>>>>>>   from /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libmpi.so.20
>>>>>>>>>>>>> #5  0x00002b16351fd0c2 in ompi_coll_base_bcast_intra_binomial ()
>>>>>>>>>>>>>   from /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libmpi.so.20
>>>>>>>>>>>>> #6  0x00002b1644fa6d9b in ompi_coll_tuned_bcast_intra_dec_fixed ()
>>>>>>>>>>>>>   from 
>>>>>>>>>>>>> /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/openmpi/mca_coll_tuned.so
>>>>>>>>>>>>> #7  0x00002b16351cb4fb in PMPI_Bcast () from 
>>>>>>>>>>>>> /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libmpi.so.20
>>>>>>>>>>>>> #8  0x00000000005c5b5d in LAMMPS_NS::Input::file() () at 
>>>>>>>>>>>>> ../input.cpp:203
>>>>>>>>>>>>> #9  0x00000000005d4236 in main () at ../main.cpp:31
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Dr. Jingchao Zhang
>>>>>>>>>>>>> Holland Computing Center
>>>>>>>>>>>>> University of Nebraska-Lincoln
>>>>>>>>>>>>> 402-472-6400
>>>>>>>>>>>>> From: users <users-boun...@lists.open-mpi.org 
>>>>>>>>>>>>> <mailto:users-boun...@lists.open-mpi.org>> on behalf of 
>>>>>>>>>>>>> r...@open-mpi.org <mailto:r...@open-mpi.org> <r...@open-mpi.org 
>>>>>>>>>>>>> <mailto:r...@open-mpi.org>>
>>>>>>>>>>>>> Sent: Monday, August 22, 2016 2:17:10 PM
>>>>>>>>>>>>> To: Open MPI Users
>>>>>>>>>>>>> Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hmmm...perhaps we can break this out a bit? The stdin will be 
>>>>>>>>>>>>> going to your rank=0 proc. It sounds like you have some 
>>>>>>>>>>>>> subsequent step that calls MPI_Bcast?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Can you first verify that the input is being correctly delivered 
>>>>>>>>>>>>> to rank=0? This will help us isolate if the problem is in the IO 
>>>>>>>>>>>>> forwarding, or in the subsequent Bcast.
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Aug 22, 2016, at 1:11 PM, Jingchao Zhang <zh...@unl.edu 
>>>>>>>>>>>>>> <mailto:zh...@unl.edu>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> We compiled openmpi/2.0.0 with gcc/6.1.0 and intel/13.1.3. Both 
>>>>>>>>>>>>>> of them have odd behaviors when trying to read from standard 
>>>>>>>>>>>>>> input.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> For example, if we start the application lammps across 4 nodes, 
>>>>>>>>>>>>>> each node 16 cores, connected by Intel QDR Infiniband, mpirun 
>>>>>>>>>>>>>> works fine for the 1st time, but always stuck in a few seconds 
>>>>>>>>>>>>>> thereafter.
>>>>>>>>>>>>>> Command:
>>>>>>>>>>>>>> mpirun ./lmp_ompi_g++ < in.snr
>>>>>>>>>>>>>> in.snr is the Lammps input file. compiler is gcc/6.1.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Instead, if we use
>>>>>>>>>>>>>> mpirun ./lmp_ompi_g++ -in in.snr
>>>>>>>>>>>>>> it works 100%.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Some odd behaviors we gathered so far. 
>>>>>>>>>>>>>> 1. For 1 node job, stdin always works.
>>>>>>>>>>>>>> 2. For multiple nodes, stdin works unstably when the number of 
>>>>>>>>>>>>>> cores per node are relatively small. For example, for 2/3/4 
>>>>>>>>>>>>>> nodes, each node 8 cores, mpirun works most of the time. But for 
>>>>>>>>>>>>>> each node with >8 cores, mpirun works the 1st time, then always 
>>>>>>>>>>>>>> stuck. There seems to be a magic number when it stops working.
>>>>>>>>>>>>>> 3. We tested Quantum Expresso with compiler intel/13 and had the 
>>>>>>>>>>>>>> same issue. 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> We used gdb to debug and found when mpirun was stuck, the rest 
>>>>>>>>>>>>>> of the processes were all waiting on mpi broadcast from the 
>>>>>>>>>>>>>> master thread. The lammps binary, input file and gdb core files 
>>>>>>>>>>>>>> (example.tar.bz2) can be downloaded from this link 
>>>>>>>>>>>>>> https://drive.google.com/open?id=0B3Yj4QkZpI-dVWZtWmJ3ZXNVRGc 
>>>>>>>>>>>>>> <https://drive.google.com/open?id=0B3Yj4QkZpI-dVWZtWmJ3ZXNVRGc>
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Extra information:
>>>>>>>>>>>>>> 1. Job scheduler is slurm.
>>>>>>>>>>>>>> 2. configure setup:
>>>>>>>>>>>>>> ./configure     --prefix=$PREFIX \
>>>>>>>>>>>>>>                --with-hwloc=internal \
>>>>>>>>>>>>>>                --enable-mpirun-prefix-by-default \
>>>>>>>>>>>>>>                --with-slurm \
>>>>>>>>>>>>>>                --with-verbs \
>>>>>>>>>>>>>>                --with-psm \
>>>>>>>>>>>>>>                --disable-openib-connectx-xrc \
>>>>>>>>>>>>>>                --with-knem=/opt/knem-1.1.2.90mlnx1 \
>>>>>>>>>>>>>>                --with-cma
>>>>>>>>>>>>>> 3. openmpi-mca-params.conf file 
>>>>>>>>>>>>>> orte_hetero_nodes=1
>>>>>>>>>>>>>> hwloc_base_binding_policy=core
>>>>>>>>>>>>>> rmaps_base_mapping_policy=core
>>>>>>>>>>>>>> opal_cuda_support=0
>>>>>>>>>>>>>> btl_openib_use_eager_rdma=0
>>>>>>>>>>>>>> btl_openib_max_eager_rdma=0
>>>>>>>>>>>>>> btl_openib_flags=1
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Jingchao 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Dr. Jingchao Zhang
>>>>>>>>>>>>>> Holland Computing Center
>>>>>>>>>>>>>> University of Nebraska-Lincoln
>>>>>>>>>>>>>> 402-472-6400
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>>>>>>>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>>>>>>>>>>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>>>>>>>>>>>>> 
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>>>>>>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>>>>>>>>>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>>>>>>>>>>>> 
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> users mailing list
>>>>>>>>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>>>>>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>>>>>>>>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>>>>>>>>>>> 
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> users mailing list
>>>>>>>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>>>>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>>>>>>>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>>>>>>>>>> 
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>>>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>>>>>>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> 
>>>>>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>>>>>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>>>>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>>>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>>>>>> 
>>>>>> <debug_info.txt>_______________________________________________
>>>>>> users mailing list
>>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>>> 
>>> 
>>> -- 
>>> Jeff Squyres
>>> jsquy...@cisco.com <mailto:jsquy...@cisco.com>
>>> For corporate legal information go to: 
>>> http://www.cisco.com/web/about/doing_business/legal/cri/ 
>>> <http://www.cisco.com/web/about/doing_business/legal/cri/>
>>> 
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>> 
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to