First, I changed the subject line  so that this message doesn't get filtered 
out. Trafodion master daily build has been failing randomly with the following 
stack trace in monitor.


(gdb) bt
#0  0x00007feaee0eb625 in raise () from /lib64/libc.so.6
#1  0x00007feaee0ece05 in abort () from /lib64/libc.so.6
#2  0x000000000041f8b3 in CProcessContainer::CProcessContainer (this=0x270e340, 
nodeContainer=<value optimized out>) at process.cxx:3389
#3  0x00000000004569cc in CNode::CNode (this=0x270e340, name=0x26e9548 
"slave-ahw23", pnid=0, rank=0) at pnode.cxx:152
#4  0x0000000000458050 in CNodeContainer::AddNodes (this=<value optimized out>) 
at pnode.cxx:1572
#5  0x0000000000419185 in CCluster::InitializeConfigCluster (this=0x2712270) at 
cluster.cxx:2818
#6  0x0000000000419e25 in CCluster::CCluster (this=0x2712270) at cluster.cxx:597
#7  0x000000000043473a in CTmSync_Container::CTmSync_Container (this=0x2712270) 
at tmsync.cxx:137
#8  0x0000000000408f36 in CMonitor::CMonitor (this=0x2712270, procTermSig=9) at 
monitor.cxx:329
#9  0x000000000040a5ab in main (argc=2, argv=0x7ffd157c0b48) at monitor.cxx:1308
(gdb) f 2

The monitor log shows
2017-03-16 09:21:48,327, INFO, MON, Node Number: 0,, PIN: 17918 , Process Name: 
$MONITOR,,, TID: 17918, Message ID: 101020103, [CMonitor::main], monitor 
Version 1.0.1 prodver Release 2.2.0 (Build release 
[2.0.1rc3-1425-g6155ff1_Bld883], branch 6155ff1_no_branch, date 20170316_0832), 
Started! CommType: Sockets
2017-03-16 09:21:48,327, INFO, MON, Node Number: 0,, PIN: 17918 , Process Name: 
$MONITOR,,, TID: 17918, Message ID: 101010401, [CCluster::CCluster] Validation 
of node down is disabled
2017-03-16 09:21:48,328, ERROR, MON, Node Number: 0,, PIN: 17918 , Process 
Name: $MONITOR,,, TID: 17918, Message ID: 101030703, 
[CProcessContainer::CProcessContainer], Can't create semaphore 
/monitor.sem.trafodion! (Permission denied)
2017-03-16 09:21:48,328, ERROR, MON, Node Number: 0,, PIN: 17918 , Process 
Name: $MONITOR,,, TID: 17918, Message ID: 101030704, 
[CProcessContainer::CProcessContainer], Can't unlink semaphore 
/monitor.sem.trafodion! (Permission denied)

I came up with the following theory

When a semaphore is created, a device file with the given semaphore name is 
created at /dev/shm by the process. The process owner needs to have write 
permission to create this file.  Initially I suspected it is permission issue 
of /dev/shm directory.

I just looked at /dev/shm in the Jenkins VM. It did have the write permission.

 If that's the case, it is possible the previous semaphore was not cleaned up 
correctly.  The monitor seems to create the semaphore with 
/dev/shm/sem.monitor.<user_name>. If trafodion gets the different uid between 
two different runs, it is possible that it is unable to clean it up. In case of 
RMS, we attach the port number to the semaphore name so that every run from the 
same user name will get a different semaphore name.

---------------------

sem_open document shows

EACCES The semaphore exists, but the caller does not have permission
              to open it

EACCES is 13 the errno returned in the gdb.

Please offer your help to resolve this issue if you have any other idea.

Selva

Reply via email to