Can you make stata to not detach from its parent process? As it is returning srun is killing all remaining unowned processes.
On Wed, Oct 16, 2013 at 4:32 PM, Yann Sagon <[email protected]> wrote: > Updata: when xstata is started, it immediately return with an error code > of 0 (but lets the GUI open). I suspect that srun think that the job > terminated successfully and kills stata. Any clue on that? > > > 2013/10/16 Yann Sagon <[email protected]> > >> In my cluster with slurm 2.6.2 I'm having a problem to run xstata (it's >> the graphical version of stata). >> >> If I launch directly xstata on the master or on any node as normal user, >> everything is fine. >> >> If I lauch xstata with srun (just srun xstata) nothings happens (no >> output, nothing special in the slurm log) and the command terminate almost >> immediately. >> >> I'm able to launch other graphical application. >> >> I have tried as well to launch xstata with --slurmd-debug : >> >> srun --slurmd-debug=4 xstata >> slurmd[node01]: debug level = 6 >> slurmd[node01]: Uncached user/gid: sagon/1000 >> slurmd[node01]: IO handler started pid=105416 >> slurmd[node01]: task 0 (105421) started 2013-10-16T15:44:54 >> slurmd[node01]: Setting slurmstepd oom_adj to -1000 >> slurmd[node01]: adding task 0 pid 105421 on node 0 to jobacct >> slurmd[node01]: 105421 mem size 1008 200024 time 0(0+0) >> slurmd[node01]: _get_sys_interface_freq_line: filename = >> /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq >> slurmd[node01]: cpu 0 freq= 2201000 >> slurmd[node01]: Task average frequency = 2201000 pid 105421 mem size 1008 >> 200024 time 0(0+0) >> slurmd[node01]: energycounted = 0 >> slurmd[node01]: getjoules_task energy = 0 >> slurmd[node01]: Sending launch resp rc=0 >> slurmd[node01]: auth plugin for Munge (http://code.google.com/p/munge/) >> loaded >> slurmd[node01]: Handling REQUEST_INFO >> slurmd[node01]: Handling REQUEST_SIGNAL_CONTAINER >> slurmd[node01]: _handle_signal_container for step=48997.0 uid=0 signal=995 >> slurmd[node01]: Uncached user/gid: sagon/1000 >> slurmd[node01]: mpi type = (null) >> slurmd[node01]: Using mpi/openmpi >> slurmd[node01]: _set_limit: conf setrlimit RLIMIT_CPU no change in value: >> 18446744073709551615 >> slurmd[node01]: _set_limit: conf setrlimit RLIMIT_FSIZE no change in >> value: 18446744073709551615 >> slurmd[node01]: _set_limit: conf setrlimit RLIMIT_DATA no change in >> value: 18446744073709551615 >> slurmd[node01]: _set_limit: conf setrlimit RLIMIT_STACK no change in >> value: 18446744073709551615 >> slurmd[node01]: _set_limit: conf setrlimit RLIMIT_CORE no change in >> value: 0 >> slurmd[node01]: _set_limit: conf setrlimit RLIMIT_RSS no change in value: >> 18446744073709551615 >> slurmd[node01]: _set_limit: conf setrlimit RLIMIT_NPROC no change in >> value: 18446744073709551615 >> slurmd[node01]: _set_limit: RLIMIT_NOFILE : max:8192 cur:8192 req:1024 >> slurmd[node01]: _set_limit: conf setrlimit RLIMIT_NOFILE succeeded >> slurmd[node01]: _set_limit: conf setrlimit RLIMIT_MEMLOCK no change in >> value: 18446744073709551615 >> slurmd[node01]: _set_limit: conf setrlimit RLIMIT_AS no change in value: >> 18446744073709551615 >> slurmd[node01]: removing task 0 pid 105421 from jobacct >> slurmd[node01]: task 0 (105421) exited with exit code 0. >> slurmd[node01]: Aggregated 1 task exit messages >> slurmd[node01]: killing process 105424 (inherited_task) with signal 9 >> slurmd[node01]: killing process 105424 (inherited_task) with signal 9 >> slurmd[node01]: Sending SIGKILL to pgid 105416 >> slurmd[node01]: Waiting for IO >> slurmd[node01]: Closing debug channel >> >> Thanks for your ideas! >> > > -- -- Carles Fenoy
