On Mon, Nov 21, 2011 at 1:44 PM, Reuti <re...@staff.uni-marburg.de> wrote:
> Am 21.11.2011 um 05:30 schrieb mahbube rustaee: > > > On Mon, Nov 21, 2011 at 3:27 AM, Reuti <re...@staff.uni-marburg.de> > wrote: > > Hi, > > > > Am 20.11.2011 um 12:37 schrieb mahbube rustaee: > > > > > 1) I run intel mpi jobs. when $NSLOTS<=50 , qsub is ok, but for slots > >50 either output is empty > > > or output of job is: > > > > > > mpirun has exited due to process rank 4 with PID 23866 on > > > node amd-7-5.local exiting without calling "finalize". This may > > > have caused other processes in the application to be > > > terminated by signals sent by mpirun (as reported here). > > > > -------------------------------------------------------------------------- > > > [amd-7-5.local:23861] 199 more processes have sent help message > help-mtl-psm.txt / unable to open endpoint > > > [amd-7-5.local:23861] Set MCA parameter "orte_base_help_aggregate" to > 0 to see all help / error messages > > > [amd-7-5.local:23861] 99 more processes have sent help message > help-mpi-runtime / mpi_init:startup:internal-failure > > > > > > what config is missed? > > > > the errors are from Open MPI, but above you state Intel MPI. Hence the > $PATH on the exechost might point to the wrong `mpiexec`. > > > > You can investigate this by `which mpiexec` in your jobscript. > > I checked that, path of mpirun is correct. my script is: > > #!/bin/sh > > #$ -S /bin/bash > > #$ -N Det2-200core > > #$ -cwd > > #$ -l h_vmem=500M,mem_free=10M > > #$ -j y > > #$ -pe mpi16 64 > > . $HOME/.intelbash > > . /var/mpi-selector/data/openmpi_intel_qlc-1.4.2.sh > > which mpirun > > And what's the output? > > > > mpirun -n $NSLOTS mpi.intel.comp > > > > .intelbash and openmpi_intel_qlc-1.4.2.sh set $PATH and library path . > > As you can't setup two MPI libraries at the same time, I would assume that > you missed an argument to the script. > > Which library you used to compile the application? This one must be used > for execution too. > those library don't confilit. I modify qsub script, run mpi-selector-menu for config of environment variables and add -V option at script. errors are: amd-10-9.local [amd-10-9.local:28409] [[44203,0],0] ORTE_ERROR_LOG: The system limit on number of network connections a process can open was reached in file oob_tcp.c at line 447 -------------------------------------------------------------------------- Error: system limit exceeded on number of network connections that can be open This can be resolved by setting the mca parameter opal_set_max_sys_limits to 1, increasing your limit descriptor setting (using limit or ulimit commands), or asking the system administrator to increase the system limit. -------------------------------------------------------------------------- but I can run job via CLI : I did ssh to master node that sge ran job and copy .po file to machinefile and run mpirun with the machinefile(same sge host file ) and run is succesful. why I can run a mpi job directly (via CLI) and sge cannot? > -- Reuti > > > > -- Reuti > > > > > > > 2) when I run a job directly via CLI, depend on number of slots also > program ,output is correct ! > > > I think some config on OS and SGE is missed! > > > > > > Thx > > > > > > _______________________________________________ > > > users mailing list > > > users@gridengine.org > > > https://gridengine.org/mailman/listinfo/users > > > > > >
_______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users