Am 17.06.2011 um 10:18 schrieb baf035: > Yes, using a classic spooling. > The spool directory is created on a nfs3 filesystem widely mounted in the HPC > enviroment. > The configuration is used for several years, never seen this kind of trouble > in SGE versions till 6.2u5 and SoGE rel. 3710.
Okay, looks fine then. I have no further idea. -- Reuti > The structure of created files under qmaster/jobs dir: > > single job (1 slot ): > qstat -u \* -s a | grep 187471 > 187471 8.50000 METODIK dzcjsjo r 06/15/2011 10:16:19 > all.q@node05np01 1 > root@sged8:/jms/spool/i001/sge_spool/qmaster# ls jobs/00/0018/7471 > jobs/00/0018/7471 > root@sged8:/jms/spool/i001/sge_spool/qmaster# file jobs/00/0018/7471 > jobs/00/0018/7471: data > > parallel job waiting or hold: > root@sged8:/jms/spool/i001/sge_spool/qmaster# qstat -u \* -s a | grep 191340 > 191340 5.62661 adc_1PQ dzcar18 qw 06/17/2011 09:35:48 > 48 > root@sged8:/jms/spool/i001/sge_spool/qmaster# ls -laR jobs/00/0019/1340/ > jobs/00/0019/1340/: > total 28 > drwxr-xr-x 2 sgeadm tkvyp 19 2011-06-17 09:35 . > drwxr-xr-x 124 sgeadm tkvyp 16384 2011-06-17 09:58 .. > -rw-r--r-- 1 sgeadm tkvyp 4708 2011-06-17 09:35 common > > parallel job running: > 191256 7.32617 ABCD_2PB d471676 r 06/17/2011 09:35:47 > all.q@node04n120 48 > > root@sged8:/jms/spool/i001/sge_spool/qmaster# ls -laR jobs/00/0019/1256 > jobs/00/0019/1256: > total 28 > drwxr-xr-x 3 sgeadm tkvyp 32 2011-06-17 09:35 . > drwxr-xr-x 136 sgeadm tkvyp 16384 2011-06-17 09:53 .. > drwxr-xr-x 3 sgeadm tkvyp 14 2011-06-17 09:35 1-4096 > -rw-r--r-- 1 sgeadm tkvyp 4820 2011-06-17 09:35 common > > jobs/00/0019/1256/1-4096: > total 0 > drwxr-xr-x 3 sgeadm tkvyp 14 2011-06-17 09:35 . > drwxr-xr-x 3 sgeadm tkvyp 32 2011-06-17 09:35 .. > drwxr-xr-x 2 sgeadm tkvyp 66 2011-06-17 09:36 1 > > jobs/00/0019/1256/1-4096/1: > total 16 > drwxr-xr-x 2 sgeadm tkvyp 66 2011-06-17 09:36 . > drwxr-xr-x 3 sgeadm tkvyp 14 2011-06-17 09:35 .. > -rw-r--r-- 1 sgeadm tkvyp 735 2011-06-17 09:36 1.r2i2n8 > -rw-r--r-- 1 sgeadm tkvyp 736 2011-06-17 09:36 1.r2i3n12 > -rw-r--r-- 1 sgeadm tkvyp 736 2011-06-17 09:36 1.r4i3n13 > -rw-r--r-- 1 sgeadm tkvyp 892 2011-06-17 09:35 common > > Above mentioned data are from a productive instance but missing in the > testing instance based on SoGE rel.3910 despite a correct job scheduling . > > baf035 > > 2011/6/16 Reuti <[email protected]> > Am 16.06.2011 um 15:03 schrieb baf035: > > > we are using SoGE rel. 3910 for tests. > > Submited jobs are correcty dispatched but no informations are stored in a > > spool direcrory <SPOOL_DIR>/qmaster/jobs. > > You are using classic spooling? > > > > In a qmaster messages file are inforamations about missing file/folder at > > the time of ending of job: > > ---------------- > > 6/16/2011 10:06:30|schedu|sged2|E|can't find parallel task 50993.1 task > > past_usage for update in function pe_task_update_master_list_usage > > 06/16/2011 10:06:30|schedu|sged2|E|callback function for event "3941466. > > EVENT JOB 50993.1 task past_usage USAGE" failed > > 06/16/2011 10:07:10|worker|sged2|E|unlink(jobs/00/0005/0993/common) failed: > > No such file or directory > > 06/16/2011 10:07:10|worker|sged2|E|can not remove file job spool file: > > jobs/00/0005/0993/common > > The "common" is strange here. What I saw in the past was just a plain file > like 0993 containing binary information of the job. > > > > 06/16/2011 10:07:10|worker|sged2|E|can not remove file job spool directory: > > jobs/00/0005/0993 > > --------------- > > qacct -j 50993 | grep end_time | uniq > > end_time Thu Jun 16 10:05:52 2011 > > -------------- > > > > > > A migration of the qmasterd leads to a total lost of job informations. No > > jobs in qstat after the migration. > > > > We have encountered also a case when files in <SPOOL_DIR>/qmaster/jobs are > > correctly created but during > > the migration disappeard without a log in the messages file. > > And it's in a shared space? > > -- Reuti > > > > Please validate this behavior and thanks for a fix. > > > > baf035 > > _______________________________________________ > > SGE-discuss mailing list > > [email protected] > > https://arc.liv.ac.uk/mailman/listinfo/sge-discuss > > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
