Dear Reuti,

Thank you for your answer. I should have specified that I am not an
admin on this cluster so I do not have access to the `qconf` command,
however you were right: `ssh` is definitively used to access nodes
(probably on purpose since we have access to several GUI apps). Your
answer made me check my ~/.ssh/ directory, and I found dozens of
*.socket files in there.

After removing these files, qmake and qrsh perform flawlessly (shepherd
exit code 0). I still do not know what caused this problem and at which
point these files were created, but I will know what to look for would
this problem reappear.

Thank you very much for your help.

Sincerely,
− Nils

On 26/02/2018 17:00, Reuti wrote:
> Hi,
>
> it looks like the connection to nodes is set to `ssh`. Does your output of:
>
> $ qconf -sconf
> #global:
> qlogin_command
> qlogin_daemon
> rlogin_command
> rlogin_daemon
> rsh_command
> rsh_daemon
>
> reflect this? Do you need `ssh` to access nodes by SGE for X11 forwarding?
>
> -- Reuti
>
>
>> Am 26.02.2018 um 15:29 schrieb Nils Giordano <[email protected]>:
>>
>> Dear all,
>>
>> I try to run a simple Makefile with qmake (SGE 8.1.9) but it fails
>> everytime after the first round of commands with the following error:
>> ------------------------------------------
>> $ qmake -cwd -v PATH -pe make 1 -verbose --
>> [...]
>> reading exit code from shepherd ... timeout (60 s) expired while waiting
>> on socket fd 4
>> error: error reading returncode of remote command
>> cleaning up after abnormal exit of /usr/bin/ssh -o LogLevel=ERROR
>> ------------------------------------------
>> ------------------------------------------
>> $ cat Makefile
>> all: a.y b.y c.y d.y e.y f.y
>>
>> %.y: %.x
>>     touch $@; sleep 3
>> ------------------------------------------
>>
>> Overall, only a.y is created. If I use N slots (with -pe make 1-N), only
>> N files are created. It seems to me that qmake gets stuck because it
>> fails to close opened connections. Note that I have the same problem
>> when I do not use the -pe option, or when I try to run qmake -inherit in
>> a qsub script. Apart from that, qsub and qlogin work fine.
>>
>> I think I narrowed the problem to be related to qrsh, as I have a
>> similar error with this command:
>> ------------------------------------------
>> $ qrsh -cwd -v PATH -verbose hostname
>> Your job 121477 ("hostname") has been submitted
>> waiting for interactive job to be scheduled ...
>> Your interactive job 121477 has been successfully scheduled.
>> Establishing /usr/bin/ssh -o LogLevel=ERROR session to host XXX.prive ...
>> XXX.prive
>> /usr/bin/ssh -o LogLevel=ERROR exited with exit code 0
>> reading exit code from shepherd ... timeout (60 s) expired while waiting
>> on socket fd 4
>> error: error reading returncode of remote command
>> cleaning up after abnormal exit of /usr/bin/ssh -o LogLevel=ERROR
>> ------------------------------------------
>>
>> Any idea what might cause this problem? You can find below the complete
>> output.
>>
>> Sincerely,
>> −Nils
>>
>> ------------------------------------------
>> Complete output:
>> $ qstat -help
>> SGE 8.1.9
>> $ qmake -cwd -verbose -v PATH -pe make 1 --
>> dynamic task allocation mode
>> sge_argv[0] = qmake
>> sge_argv[1] = -cwd
>> sge_argv[2] = -verbose
>> sge_argv[3] = -v
>> sge_argv[4] = PATH
>> sge_argv[5] = -pe
>> sge_argv[6] = make
>> sge_argv[7] = 1
>> gmake_argv[0]  = qmake
>> determine qmake startmode
>> setting default options: -l arch=lx-amd64
>> creating scheduled qmake
>> argv[  0] = qrsh
>> argv[  1] = -noshell
>> argv[  2] = -cwd
>> argv[  3] = -verbose
>> argv[  4] = -v
>> argv[  5] = PATH
>> argv[  6] = -pe
>> argv[  7] = make
>> argv[  8] = 1
>> argv[  9] = -l
>> argv[ 10] = arch=lx-amd64
>> argv[ 11] = qmake
>> argv[ 12] = -inherit
>> argv[ 13] = -verbose
>> argv[ 14] = -cwd
>> argv[ 15] = -v
>> argv[ 16] = PATH
>> argv[ 17] = -l
>> argv[ 18] = arch=lx-amd64
>> argv[ 19] = --
>> Your job 121548 ("qmake") has been submitted
>> waiting for interactive job to be scheduled ...
>> Your interactive job 121548 has been successfully scheduled.
>> Establishing /usr/bin/ssh -o LogLevel=ERROR session to host
>> gknzwd2.XXX.prive ...
>> sge_argv[0] = qmake
>> sge_argv[1] = -inherit
>> sge_argv[2] = -verbose
>> sge_argv[3] = -cwd
>> sge_argv[4] = -v
>> sge_argv[5] = PATH
>> sge_argv[6] = -l
>> sge_argv[7] = arch=lx-amd64
>> gmake_argv[0]  = qmake
>> determine qmake startmode
>> inserting -j option from NSLOTS environment: -j 1
>> sge hostfile =
>> /opt/sge/BiRD_v2/spool/gknzwd2/active_jobs/121548.1/pe_hostfile
>> qmake  hostfile = /tmp/121548.1.max-24h.q/qmake_hostfile
>> qmake  lockfile = /tmp/121548.1.max-24h.q/qmake_lockfile
>> creating qmake hostfile
>> number of slots for qmake execution is 1
>> enabling next task to be executed as Grid Engine parallel task
>> touch a.y; sleep 3
>> export the following environment variables:
>> SGE_RSH_COMMAND,BASH_FUNC_module(),MAKEFLAGS,MFLAGS,MAKELEVEL
>> obtained lock to qmake lockfile
>> clearing lock to hostfile
>> next host for qmake job is gknzwd2.XXX.prive
>> gknzwd2.XXX.prive
>> gmake requesting status of dead child processes
>> gmake requesting status of dead child processes
>> waiting for child failed: timeout
>> starting job:
>> args[  0] = qrsh
>> args[  1] = -noshell
>> args[  2] = -verbose
>> args[  3] = -inherit
>> args[  4] = -cwd
>> args[  5] = -v
>> args[  6] = SGE_RSH_COMMAND,BASH_FUNC_module(),MAKEFLAGS,MFLAGS,MAKELEVEL
>> args[  7] = -v
>> args[  8] = PATH
>> args[  9] = gknzwd2.XXX.prive
>> args[ 10] = /bin/sh
>> args[ 11] = -c
>> args[ 12] = touch a.y; sleep 3
>> Starting server daemon at host "gknzwd2.XXX.prive"
>> Server daemon successfully started with task id "1.gknzwd2"
>> Establishing /usr/bin/ssh -o LogLevel=ERROR session to host
>> gknzwd2.XXX.prive ...
>> /usr/bin/ssh -o LogLevel=ERROR exited with exit code 0
>> reading exit code from shepherd ... timeout (60 s) expired while waiting
>> on socket fd 4
>> error: error reading returncode of remote command
>> obtained lock to qmake lockfile
>> unlock_hostentry 0
>> clearing lock to hostfile
>> qmake: *** [a.y] Error 255
>> cleanup of remote mechanism
>> /usr/bin/ssh -o LogLevel=ERROR exited with exit code 0
>> reading exit code from shepherd ... timeout (60 s) expired while waiting
>> on socket fd 4
>> error: error reading returncode of remote command
>> cleaning up after abnormal exit of /usr/bin/ssh -o LogLevel=ERROR
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> [email protected]
>> https://gridengine.org/mailman/listinfo/users
>>

-- 
Nils Giordano, PhD
Postdoc in Computational Biology team (ComBi)
Laboratoire des Sciences du Numérique de Nantes (LS2N), UMR 6004
https://www.normalesup.org/~giordano/



_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to