Hi,

I have these:

# pwd
/dev/shm
# ls -la
total 4
drwxrwxrwx 2 root      root        60 Oct  6 21:07 .
drwxr-xr-x 9 root      root      2180 Oct  2 22:28 ..
-rw-r--r-- 1 trafodion trafodion   32 Oct  6 21:07 sem.monitor.sem.trafodion

kernel.shmmax = 68719476736
kernel.shmall = 4294967296

# ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 1805076
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 65535
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 65535
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

I would try to reinstall trafodion to see it something got corrupted and
maybe that would fix the issue but I know there was a crash on sqstart and
one of your guys fixed it and copied the lib file to our cluster:

This is a response from Narendra in a previous thread where the issue was
fixed to start the trafodion:


>
>
>
> *I updated the code: sql/cli/memmonitor.cpp, so that if /proc/meminfo does
> not have the ‘Committed_AS’ entry, it will ignore it. Built it and put the
> binary: libcli.so on the veracity box (in the $MY_SQROOT/export/lib64
> directory – on all the nodes). Restarted the env and ‘sqlci’ worked fine.
> Was able to ‘initialize trafodion’ and create a table.*


There was another one similar which I see it's closed
https://issues.apache.org/jira/browse/TRAFODION-1492

So the idea is are these fixes in the latest daily build and I can try to
reinstall? Or please send the changed files so I can override after
reinstall.

On Wed, Oct 7, 2015 at 6:02 PM, Selva Govindarajan <
[email protected]> wrote:

> You would want to retain the shared segment size across reboots. So, please
> check if the following settings are available in /etc/sysctl.conf
>
> # Controls the maximum shared segment size, in bytes
> kernel.shmmax = 134217728
>
> # Controls the maximum number of shared memory segments, in pages
> kernel.shmall = 4294967296
>
>
> shmmax needs to be at least 64 MB. By default, Trafodion RMS shared segment
> size is 64 MB. Trafodion RMS shared segment can be expanded to 128 MB. So,
> it is better to set shmmax to 128 mb, just in case we need to expand it
> later.
>
> Selva
>
> -----Original Message-----
> From: Prashanth Vasudev [mailto:[email protected]]
> Sent: Tuesday, October 6, 2015 2:19 PM
> To: [email protected]
> Subject: RE: trafodion won't start core files are generated
>
> Hi,
> From the stack trace below, it appears trafodion monitor is unable to
> create
> shared memory objects.
> Please makes sure ulimit settings on all nodes have high limits for max
> locked memory.
> Also make sure /dev/shm on all nodes have the correct write permissions to
> trafodion user id.
>
> Regards,
> Prashanth
>
> -----Original Message-----
> From: Radu Marias [mailto:[email protected]]
> Sent: Tuesday, October 6, 2015 9:21 AM
> To: dev <[email protected]>
> Subject: trafodion won't start core files are generated
>
> Hi,
>
> At some point a node from the 5 nodes cluster has stopped and we needed to
> restart it, After that I've restarted all the ambari and hdp services but
> trafodion fails to start.
>
> Bellow are some stack traces and details for files that I'm not getting any
> stack. Files are from node1 and node2 and were in Oct  2 (when I think node
> 2 was down) and Oct  6 (when re rebooted the node and tried to start
> trafodion). Feel free to connect and debug the issue on our cluster, Amanda
> has the credentials.
>
> *FROM NODE1*
>
> Oct  2 22:27 core.39347
> core.39347: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style,
> from 'tm SQMON1.1 00000 00000 039347 $TM0 188.138.61.175:60186 00002 00000
> 00009 SPAR'
> gdb /home/trafodion/trafodion-20150828_0830/export/bin64/tdm_udrserv
> core.39347
> no stack
>
> Oct  2 22:41 core.15144
> Program terminated with signal 6, Aborted.
> #0  0x00007f77bcbbb625 in ?? ()
> #1  0x00007f77bcbbce05 in ?? ()
> #2  0x0000000000000010 in ?? () at ../common/Collections.cpp:109
> #3  0x00007f77bee62130 in ?? ()
> #4  0x00007ffe8e796ec0 in ?? ()
> #5  0x00007f77bdeced00 in ?? ()
> #6  0x0000000000000004 in ?? () at ../common/Collections.cpp:109
> #7  0x0000000001b3a310 in ?? ()
> #8  0x0000000000000000 in ?? ()
>
> Oct  2 22:41 core.39240
> #0  0x00007f534d03c625 in raise () from /lib64/libc.so.6
> #1  0x00007f534d03de05 in abort () from /lib64/libc.so.6
> #2  0x00007f534d03574e in __assert_fail_base () from /lib64/libc.so.6
> #3  0x00007f534d035810 in __assert_fail () from /lib64/libc.so.6
> #4  0x000000000046e213 in CExtTmLeaderReq::performRequest
> (this=0x7f53340008c0) at reqtmleader.cxx:126
> #5  0x000000000045a64a in CReqWorker::reqWorkerThread (this=<value
> optimized
> out>) at reqworker.cxx:79
> #6  0x000000000045a86d in reqWorker (arg=0xc6f9a0) at reqworker.cxx:147
> #7  0x00007f534db45a51 in start_thread () from /lib64/libpthread.so.0
> #8  0x00007f534d0f29ad in clone () from /lib64/libc.so.6
>
> Oct  2 22:41 core.15309
> core.15309: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style,
> from 'tm SQMON1.1 00000 00000 015309 $TM0 188.138.61.175:60186 00002 00000
> 00134 SPAR'
> gdb /home/trafodion/trafodion-20150828_0830/export/bin64/tdm_udrserv
> core.15309
> no stack
>
>
> *FROM NODE2*
>
> Oct  2 22:29 core.39491
> core.39491: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style,
> from 'tm SQMON1.1 00001 00001 039491 $TM1 188.138.61.177:38680 00002 00001
> 00003 SPAR'
> gdb /home/trafodion/trafodion-20150828_0830/export/bin64/tdm_udrserv
> core.39491
> no stack
>
> Oct  6 15:23 core.1394
> Program terminated with signal 6, Aborted.
> #0  0x00007fb97acbf625 in raise () from /lib64/libc.so.6
> #1  0x00007fb97acc0e05 in abort () from /lib64/libc.so.6
> #2  0x000000000041d07d in CProcessContainer::CProcessContainer
> (this=0x2071880, nodeContainer=<value optimized out>) at process.cxx:3366
> #3  0x0000000000453f5c in CNode::CNode (this=0x2071880, name=0x204c448
> "euve79672", pnid=0, rank=0) at pnode.cxx:153
> #4  0x00000000004558e0 in CNodeContainer::AddNodes (this=<value optimized
> out>) at pnode.cxx:1564
> #5  0x00000000004169a5 in CCluster::InitializeConfigCluster
> (this=0x20757b0) at cluster.cxx:2740
> #6  0x0000000000417645 in CCluster::CCluster (this=0x20757b0) at
> cluster.cxx:567
> #7  0x0000000000431e1a in CTmSync_Container::CTmSync_Container
> (this=0x20757b0) at tmsync.cxx:137
> #8  0x0000000000407bb6 in CMonitor::CMonitor (this=0x20757b0,
> procTermSig=9) at monitor.cxx:323
> #9  0x00000000004086ad in main (argc=2, argv=0x7fff8322e298) at
> monitor.cxx:1152
>
> Oct  6 15:43 core.17626
> Program terminated with signal 6, Aborted.
> #0  0x00007fcf11aea625 in raise () from /lib64/libc.so.6
> #1  0x00007fcf11aebe05 in abort () from /lib64/libc.so.6
> #2  0x000000000041d07d in CProcessContainer::CProcessContainer
> (this=0x1182890, nodeContainer=<value optimized out>) at process.cxx:3366
> #3  0x0000000000453f5c in CNode::CNode (this=0x1182890, name=0x115d458
> "euve79672", pnid=0, rank=0) at pnode.cxx:153
> #4  0x00000000004558e0 in CNodeContainer::AddNodes (this=<value optimized
> out>) at pnode.cxx:1564
> #5  0x00000000004169a5 in CCluster::InitializeConfigCluster
> (this=0x11867c0) at cluster.cxx:2740
> #6  0x0000000000417645 in CCluster::CCluster (this=0x11867c0) at
> cluster.cxx:567
> #7  0x0000000000431e1a in CTmSync_Container::CTmSync_Container
> (this=0x11867c0) at tmsync.cxx:137
> #8  0x0000000000407bb6 in CMonitor::CMonitor (this=0x11867c0,
> procTermSig=9) at monitor.cxx:323
> #9  0x00000000004086ad in main (argc=2, argv=0x7ffcaca91f68) at
> monitor.cxx:1152
>
> --
> And in the end, it's not the years in your life that count. It's the life
> in
> your years.
>



-- 
And in the end, it's not the years in your life that count. It's the life
in your years.

Reply via email to