Hi,
thanks for your answer, Reuti:
The spool directory is a local folder, which exists and can be used (to confirm 
I just tested this with the "KEEP_ACTIVE" parameter set - interestingly the 
error occured not before the fourth of the small jobs, which indicates, that 
time (or load, or similar) are probably involved. When comparing the files, 
those of the finished jobs look very similar, for the error-job only 'config', 
'environment' and the 'pe_hostfile' were created, but they seemed normal).

The binaries are all the same, they're not on a nfs-mount, but copies of each 
other. Except the ones on one test-node, of course, which I compiled myself to 
see, whether this solves my problem.

And there is only one execd running - process was the wrong word, maybe 
"threads"? If I 'ps axu | grep execd' I only find one daemon, but if I 'strace 
-ff -p' it, one gets the output of five processes/threads. I would have liked 
to start execd in some single-thread-mode, since I think parallelism can be a 
source of seemingly random segfaults, but I didn't find an option.

I might be wrong there and I'm willing to test those things, but I do not 
really think, that my ge-configuration causes the problem, since all I did was 
an update of suse linux enterprise server 11.1 to suse linux enterprise server 
11.3 of only the execution hosts, I didn't touch the sge-folders at all. And 
secondly the only "error" I can read out of the straces is a segfault, which 
normally shouldn't be caused by a wrong configuration file, or so, too...
Sven

> Date: Thu, 17 Apr 2014 17:11:47 +0200
> From: Reuti <[email protected]>
> To: Sve N <[email protected]>
> Cc: "[email protected]" <[email protected]>
> Subject: Re: [gridengine users] GE2011.11p1 on SLES 11.3 - execution
>       host in error state
> Message-ID:
>       <[email protected]>
> Content-Type: text/plain; charset=us-ascii
> 
> Hi,
> 
> Am 17.04.2014 um 16:30 schrieb Sve N:
> 
> > we have been using open grid engine for some time now on our linux 
> > machines, which were running suse linux enterprise server 11.1. I recently 
> > updated some of them to SLES 11.3 (some were just patched, and some had a 
> > fresh install) and since then, gridengine has some faulty behavior:
> > 
> > The first job submitted to an execution host runs and finishes correctly, 
> > but if one submits a second one, the host switches into an error state 
> > instantly, leaving the second job as 'qw'. I seems as if there is a very 
> > small time-window (< ~1 s), where a second job can be submitted after the 
> > first one but anything later, independent of whether the first one is still 
> > running or not, results in the error.
> > To be able to run the next job, one has to stop and start 
> > /etc/init.d/sgeexecd.
> > 
> > The messages-file of the spools-directory says:
> > ___________________________________________________________________________________________________
> > 04/17/2014 11:40:15|  main|host-4|I|controlled shutdown 2011.11
> > 04/17/2014 11:40:22|  main|host-4|I|starting up OGS/GE 2011.11 (linux-x64)
> > 04/17/2014 11:41:28|  main|host-4|E|shepherd of job 4758.1 died through 
> > signal = 11
> > 04/17/2014 11:41:28|  main|host-4|E|abnormal termination of shepherd for 
> > job 4758.1: no "exit_status" file
> > 04/17/2014 11:41:28|  main|host-4|E|can't open file 
> > active_jobs/4758.1/error: Datei oder Verzeichnis nicht gefunden
> 
> What is the location of the spool directory on the exechosts? Does it live in 
> the NFS location where you are sharing the binaries, or do they go to a local 
> place like /var/spool/sge? Maybe this needs to be created.
> 
> 
> > 04/17/2014 11:41:28|  main|host-4|E|can't open pid file 
> > "active_jobs/4758.1/pid" for job 4758.1
> > ___________________________________________________________________________________________________
> > 
> > Where 4758 is the second job. The signal mostly is 11, sometimes 6, I don't 
> > know how to influence this. I used strace on the execd, to maybe get a 
> > clue. The output for the newly started process, which is invoked for the 
> > second job contained this:
> > ___________________________________________________________________________________________________
> > set_robust_list(0x7f0c9366c9e0, 0x18) = 0
> > getsockname(3, {sa_family=AF_INET, sin_port=htons(60960), 
> > sin_addr=inet_addr("1.2.3.4")}, [16]) = 0
> > getpeername(3, {sa_family=AF_INET, sin_port=htons(389), 
> > sin_addr=inet_addr("5.6.7.8")}, [16]) = 0
> > fcntl(3, F_GETFD) = 0x1 (flags FD_CLOEXEC)
> > dup(3)                                  = 7
> > fcntl(7, F_SETFD, FD_CLOEXEC) = 0
> > socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 8
> > fcntl(8, F_GETFD) = 0
> > dup2(8, 3) = 3
> > fcntl(3, F_SETFD, 0) = 0
> > close(8)                                = 0
> > --- SIGSEGV (Segmentation fault) @ 0 (0) ---
> > ___________________________________________________________________________________________________
> > 
> > Since there is a segmentation fault, I thought, that maybe some libraries 
> > changed on the new suse version, so I compiled gridengine on one of the new 
> > machines. Since I probably don't need everything, they are only used as 
> > execution hosts, I used ./aimk -only-core -no-jni -no-java. With some
> 
> All machines should use the same binaries. Do you run different version on 
> different machines on the cluster?
> 
> 
> > tinkering it finally worked until and including the creation of the local 
> > distribution. But the install_execd-script complained that qmake, qtcsh, 
> > rlogin, rsh and rshd are missing. So I just copied all the other binaries, 
> > libraries and files to a host with an old gridengine version installed. 
> > Unfortunately this didn't solve the problem.
> > The output of strace of the new process now looks a bit different:
> > ___________________________________________________________________________________________________
> > set_robust_list(0x7ffe451639e0, 0x18) = 0
> > getsockname(3, {sa_family=AF_INET, sin_port=htons(51974), 
> > sin_addr=inet_addr("1.2.3.5")}, [16]) = 0
> > getpeername(3, {sa_family=AF_INET, sin_port=htons(389), 
> > sin_addr=inet_addr("5.6.7.8")}, [16]) = 0
> > fcntl(3, F_GETFD)                  = 0x1 (flags FD_CLOEXEC)
> > dup(3) = 7
> > fcntl(7, F_SETFD, FD_CLOEXEC) = 0
> > socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 8
> > fcntl(8, F_GETFD)                                = 0
> > dup2(8, 3) = 3
> > fcntl(3, F_SETFD, 0)                    = 0
> > close(8) = 0
> > open("/dev/tty", O_RDWR|O_NOCTTY|O_NONBLOCK)                                
> > = -1 ENXIO (No such device or address)
> > writev(2, [{"*** glibc detected *** ", 23}, 
> > {"/opt/sge/bin/linux-x64/sge_execd", 32}, {": ", 2}, {"free(): invalid 
> > pointer", 23}, {": 0x", 4}, {"00007ffe44425188", 16}, {" ***\n", 5}], 7)    
> >                             = 105
> > open("/opt/sge/bin/linux-x64/../../lib/linux-x64/libgcc_s.so.1", O_RDONLY) 
> > = -1 ENOENT (No such file or directory)
> > open("/opt/sge/lib/linux-x64/libgcc_s.so.1", O_RDONLY) = -1 ENOENT (No such 
> > file or directory)
> > open("/etc/ld.so.cache", O_RDONLY) = 8
> > fstat(8, {st_mode=S_IFREG|0644, st_size=50062, ...}) = 0
> > mmap(NULL, 50062, PROT_READ, MAP_PRIVATE, 8, 0) = 0x7ffe45136000
> > close(8) = 0
> > open("/lib64/libgcc_s.so.1", O_RDONLY)          = 8
> > read(8, 
> > "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\200.\0\0\0\0\0\0"..., 832) 
> > = 832
> > fstat(8, {st_mode=S_IFREG|0755, st_size=88552, ...}) = 0
> > mmap(NULL, 2184216, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 8, 0) = 
> > 0x7ffe3fe06000
> > fadvise64(8, 0, 2184216, POSIX_FADV_WILLNEED) = 0
> > mprotect(0x7ffe3fe1b000, 2093056, PROT_NONE) = 0
> > mmap(0x7ffe4001a000, 8192, PROT_READ|PROT_WRITE, 
> > MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 8, 0x14000)                         = 
> > 0x7ffe4001a000
> > close(8)                                = 0
> > mprotect(0x7ffe4001a000, 4096, PROT_READ) = 0
> > munmap(0x7ffe45136000, 50062)           = 0
> > futex(0x7ffe448ba610, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> > futex(0x7ffe4001b1a4, FUTEX_WAKE_PRIVATE, 2147483647)          = 0
> > write(2, "======= Backtrace: =========\n", 29)                              
> >   = 29
> > writev(2, [{"/lib64/libc.so.6", 16}, {"(", 1}, {"+0x", 3}, {"76618", 5}, 
> > {")", 1}, {"[0x", 3}, {"7ffe445ba618", 12}, {"]\n", 2}], 8)        = 43
> > writev(2, [{"/usr/lib64/libldap-2.4.so.2", 27}, {"(", 1}, 
> > {"ldap_free_urldesc", 17}, {"+0x", 3}, {"19", 2}, {")", 1}, {"[0x", 3}, 
> > {"7ffe435ae449", 12}, {"]\n", 2}], 9)                              = 68
> > writev(2, [{"/usr/lib64/libldap-2.4.so.2", 27}, {"(", 1}, 
> > {"ldap_free_urllist", 17}, {"+0x", 3}, {"18", 2}, {")", 1}, {"[0x", 3}, 
> > {"7ffe435ae4c8", 12}, {"]\n", 2}], 9) = 68
> > writev(2, [{"/usr/lib64/libldap-2.4.so.2", 27}, {"(", 1}, 
> > {"ldap_free_connection", 20}, {"+0x", 3}, {"132", 3}, {")", 1}, {"[0x", 3}, 
> > {"7ffe435aaed2", 12}, {"]\n", 2}], 9)                               = 72
> > writev(2, [{"/usr/lib64/libldap-2.4.so.2", 27}, {"(", 1}, {"ldap_ld_free", 
> > 12}, {"+0x", 3}, {"b7", 2}, {")", 1}, {"[0x", 3}, {"7ffe435a1d77", 12}, 
> > {"]\n", 2}], 9) = 63
> > writev(2, [{"/lib64/libnss_ldap.so.2", 23}, {"(", 1}, {"+0x", 3}, {"4047", 
> > 4}, {")", 1}, {"[0x", 3}, {"7ffe437d5047", 12}, {"]\n", 2}], 8)             
> >                    = 49
> > writev(2, [{"/lib64/libnss_ldap.so.2", 23}, {"(", 1}, {"+0x", 3}, {"7ad5", 
> > 4}, {")", 1}, {"[0x", 3}, {"7ffe437d8ad5", 12}, {"]\n", 2}], 8)            
> > = 49
> > writev(2, [{"/lib64/libc.so.6", 16}, {"(", 1}, {"__libc_fork", 11}, {"+0x", 
> > 3}, {"1df", 3}, {")", 1}, {"[0x", 3}, {"7ffe445ec87f", 12}, {"]\n", 2}], 9) 
> >                  = 52
> > writev(2, [{"/opt/sge/bin/linux-x64/sge_execd", 32}, {"(", 1}, 
> > {"sge_exec_job", 12}, {"+0x", 3}, {"5a05", 4}, {")", 1}, {"[0x", 3}, 
> > {"4332f5", 6}, {"]\n", 2}], 9) = 64
> > writev(2, [{"/opt/sge/bin/linux-x64/sge_execd", 32}, {"[0x", 3}, {"435293", 
> > 6}, {"]\n", 2}], 4) = 43
> > writev(2, [{"/opt/sge/bin/linux-x64/sge_execd", 32}, {"(", 1}, 
> > {"do_ck_to_do", 11}, {"+0x", 3}, {"286", 3}, {")", 1}, {"[0x", 3}, 
> > {"435906", 6}, {"]\n", 2}], 9) = 62
> > writev(2, [{"/opt/sge/bin/linux-x64/sge_execd", 32}, {"(", 1}, 
> > {"sge_execd_process_messages", 26}, {"+0x", 3}, {"43c", 3}, {")", 1}, 
> > {"[0x", 3}, {"42cd7c", 6}, {"]\n", 2}], 9) = 77
> > writev(2, [{"/opt/sge/bin/linux-x64/sge_execd", 32}, {"(", 1}, {"main", 4}, 
> > {"+0x", 3}, {"b14", 3}, {")", 1}, {"[0x", 3}, {"429ed4", 6}, {"]\n", 2}], 
> > 9) = 55
> > writev(2, [{"/lib64/libc.so.6", 16}, {"(", 1}, {"__libc_start_main", 17}, 
> > {"+0x", 3}, {"e6", 2}, {")", 1}, {"[0x", 3}, {"7ffe44562c36", 12}, {"]\n", 
> > 2}], 9) = 57
> > writev(2, [{"/opt/sge/bin/linux-x64/sge_execd", 32}, {"(", 1}, 
> > {"setlocale", 9}, {"+0x", 3}, {"1f9", 3}, {")", 1}, {"[0x", 3}, {"428cd9", 
> > 6}, {"]\n", 2}], 9) = 60
> > write(2, "======= Memory map: ========\n", 29)                              
> >   = 29
> > open("/proc/self/maps", O_RDONLY)                = 8
> > read(8, "00400000-005b1000 r-xp 00000000 "..., 1024) = 1024
> > write(2, "00400000-005b1000 r-xp 00000000 "..., 1024) = 1024
> > read(8, "           /lib64/libz.so.1.2.7\n"..., 1024) = 1024
> > write(2, "           /lib64/libz.so.1.2.7\n"..., 1024)                      
> >           = 1024
> > read(8, "0.1\n7ffe41e40000-7ffe41e41000 rw"..., 1024) = 1024
> > write(2, "0.1\n7ffe41e40000-7ffe41e41000 rw"..., 1024) = 1024
> > read(8, "03:01 2611268                   "..., 1024) = 1024
> > write(2, "03:01 2611268                   "..., 1024) = 1024
> > read(8, "000 r--p 00014000 103:01 1733331"..., 1024) = 1024
> > write(2, "000 r--p 00014000 103:01 1733331"..., 1024) = 1024
> > read(8, "m_err.so.2.1\n7ffe42ebc000-7ffe42"..., 1024) = 1024
> > write(2, "m_err.so.2.1\n7ffe42ebc000-7ffe42"..., 1024)        = 1024
> > read(8, ".2.7.1\n7ffe43586000-7ffe43587000"..., 1024) = 1024
> > write(2, ".2.7.1\n7ffe43586000-7ffe43587000"..., 1024) = 1024
> > read(8, "\n7ffe439e6000-7ffe439f2000 rw-p "..., 1024) = 1024
> > write(2, "\n7ffe439e6000-7ffe439f2000 rw-p "..., 1024) = 1024
> > read(8, "0000 00:00 0 \n7ffe448bc000-7ffe4"..., 1024) = 1024
> > write(2, "0000 00:00 0 \n7ffe448bc000-7ffe4"..., 1024)                  = 
> > 1024
> > read(8, "ibdl-2.11.3.so\n7ffe44f54000-7ffe"..., 1024) = 1024
> > write(2, "ibdl-2.11.3.so\n7ffe44f54000-7ffe"..., 1024)                      
> >           = 1024
> > read(8, " 00:00 0                        "..., 1024) = 206
> > write(2, " 00:00 0                        "..., 206)        = 206
> > read(8, "", 1024)                      = 0
> > close(8) = 0
> > rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8)                        = 0
> > tgkill(13089, 13089, SIGABRT) = 0
> > --- SIGABRT (Aborted) @ 0 (0) ---
> > ___________________________________________________________________________________________________
> > 
> > I can't interpret this good enough, to know what went wrong (if it's in it 
> > in the first place). The strace output of the five execd-processes running
> 
> There should only be one execd per host. If there is still an old one 
> running, maybe it's best to reboot the machine.
> 
> -- Reuti
> 
> 
> > constantly in the background is too long. The one probably managing the job 
> > didn't look very different, comparing a working and a not working 
> > submission (first and second job), the first real difference are the last 
> > two lines out of this excerpt, the rest is only slightly differing numbers 
> > etc.:
> > ___________________________________________________________________________________________________
> > [...]
> > stat("/proc/6/status", {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> > open("/proc/6/status", O_RDONLY) = 8
> > fstat(8, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> > mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
> > 0x7ffe45169000
> > read(8, "Name:\tmigration/0\nState:\tS (slee"..., 1024) = 799
> > close(8) = 0
> > munmap(0x7ffe45169000, 4096)                               = 0
> > close(8) = -1 EBADF (Bad file descriptor)
> > stat("/proc/7/stat", {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> > open("/proc/7/stat", O_RDONLY)          = 8
> > read(8, "7 (watchdog/0) S 2 0 0 0 -1 2216"..., 1023) = 164
> > close(8) = 0
> > stat("/proc/7/status", {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> > open("/proc/7/status", O_RDONLY) = 8
> > fstat(8, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> > mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
> > 0x7ffe45169000
> > read(8, "Name:\twatchdog/0\nState:\tS (sleep"..., 1024) = 800
> > close(8)                                = 0
> > munmap(0x7ffe45169000, 4096)            = 0
> > close(8)                                = -1 EBADF (Bad file descriptor)
> > stat("/proc/8/stat", {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> > open("/proc/8/stat", O_RDONLY)          = 8
> > read(8, "8 (migration/1) S 2 0 0 0 -1 221"..., 1023) = 164
> > close(8)                                = 0
> > --- SIGCHLD (Child exited) @ 0 (0) ---
> > rt_sigreturn(0x11)                      = 0
> > ___________________________________________________________________________________________________
> > 
> > I'd be happy about any suggestion on how to solve this, or just where I 
> > could continue searching for the root of the problem.
> > Thanks, Sven
> > _______________________________________________
> > users mailing list
> > [email protected]
> > https://gridengine.org/mailman/listinfo/users

                                          
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to