Hi Ralph, Here's the output with openmpi-v1.8.4-202-gc2da6a5.tar.bz2: $ shmemrun -H localhost -N 2 --mca sshmem mmap --mca plm_base_verbose 5 $PWD/mic.out [atl1-01-mic0:190189] mca:base:select:( plm) Querying component [rsh] [atl1-01-mic0:190189] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL [atl1-01-mic0:190189] mca:base:select:( plm) Query of component [rsh] set priority to 10 [atl1-01-mic0:190189] mca:base:select:( plm) Querying component [isolated] [atl1-01-mic0:190189] mca:base:select:( plm) Query of component [isolated] set priority to 0 [atl1-01-mic0:190189] mca:base:select:( plm) Querying component [slurm] [atl1-01-mic0:190189] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a module [atl1-01-mic0:190189] mca:base:select:( plm) Selected component [rsh] [atl1-01-mic0:190189] plm:base:set_hnp_name: initial bias 190189 nodename hash 4121194178 [atl1-01-mic0:190189] plm:base:set_hnp_name: final jobfam 32137 [atl1-01-mic0:190189] [[32137,0],0] plm:rsh_setup on agent ssh : rsh path NULL [atl1-01-mic0:190189] [[32137,0],0] plm:base:receive start comm [atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_job [atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm [atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm creating map [atl1-01-mic0:190189] [[32137,0],0] setup:vm: working unmanaged allocation [atl1-01-mic0:190189] [[32137,0],0] using dash_host [atl1-01-mic0:190189] [[32137,0],0] checking node atl1-01-mic0 [atl1-01-mic0:190189] [[32137,0],0] ignoring myself [atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm only HNP in allocation [atl1-01-mic0:190189] [[32137,0],0] complete_setup on job [32137,1] [atl1-01-mic0:190189] [[32137,0],0] ORTE_ERROR_LOG: Not found in file base/plm_base_launch_support.c at line 440 [atl1-01-mic0:190189] [[32137,0],0] plm:base:launch_apps for job [32137,1] [atl1-01-mic0:190189] [[32137,0],0] plm:base:launch wiring up iof for job [32137,1] [atl1-01-mic0:190189] [[32137,0],0] plm:base:launch [32137,1] registered [atl1-01-mic0:190189] [[32137,0],0] plm:base:launch job [32137,1] is not a dynamic spawn -------------------------------------------------------------------------- It looks like SHMEM_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during SHMEM_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open SHMEM developer): mca_memheap_base_select() failed --> Returned "Error" (-1) instead of "Success" (0) -------------------------------------------------------------------------- [atl1-01-mic0:190191] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting [atl1-01-mic0:190192] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting -------------------------------------------------------------------------- SHMEM_ABORT was invoked on rank 1 (pid 190192, host=atl1-01-mic0) with errorcode -1. -------------------------------------------------------------------------- -------------------------------------------------------------------------- A SHMEM process is aborting at a time when it cannot guarantee that all of its peer processes in the job will be killed properly. You should double check that everything has shut down cleanly. Local host: atl1-01-mic0 PID: 190192 -------------------------------------------------------------------------- ------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted. ------------------------------------------------------- [atl1-01-mic0:190189] [[32137,0],0] plm:base:orted_cmd sending orted_exit commands -------------------------------------------------------------------------- shmemrun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[32137,1],0] Exit code: 255 -------------------------------------------------------------------------- [atl1-01-mic0:190189] 1 more process has sent help message help-shmem-runtime.txt / shmem_init:startup:internal-failure [atl1-01-mic0:190189] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [atl1-01-mic0:190189] 1 more process has sent help message help-shmem-api.txt / shmem-abort [atl1-01-mic0:190189] 1 more process has sent help message help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee all killed [atl1-01-mic0:190189] [[32137,0],0] plm:base:receive stop comm On 04/11/2015 07:41 PM, Ralph Castain
wrote:
Got it - thanks. I fixed that ERROR_LOG issue (I think- please verify). I suspect the memheap issue relates to something else, but I probably need to let the OSHMEM folks comment on it |
- [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel X... Andy Riebs
- Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM o... Ralph Castain
- Re: [OMPI users] Problems using Open MPI 1.8.4 OSHM... Andy Riebs
- Re: [OMPI users] Problems using Open MPI 1.8.4 ... Ralph Castain
- Re: [OMPI users] Problems using Open MPI 1.... Andy Riebs
- Re: [OMPI users] Problems using Open M... Ralph Castain
- Re: [OMPI users] Problems using Op... Andy Riebs
- Re: [OMPI users] Problems usin... Ralph Castain
- Re: [OMPI users] Problems usin... Riebs, Andy
- Re: [OMPI users] Problems usin... Andy Riebs
- Re: [OMPI users] Problems usin... Andy Riebs
- Re: [OMPI users] Problems usin... Ralph Castain
- Re: [OMPI users] Problems usin... Nathan Hjelm
- Re: [OMPI users] Problems usin... Andy Riebs
- Re: [OMPI users] Problems usin... Ralph Castain
- Re: [OMPI users] Problems usin... Andy Riebs
- Re: [OMPI users] Problems usin... Ralph Castain