A backtrace should provide enough information to know where to look for issues and that should not take a long time. Maybe you can use monit to monitor the cpu and on failure run 'kamctl trap' to get the backtrace. if cpu is greater than 50% for 5 cycles then exec "/usr/sbin/kamctl trap" Make sure that you have the debug rpm installed.
-ovidiu On Tue, Sep 29, 2015 at 1:40 PM, Alex Balashov <abalas...@evaristesys.com> wrote: > Hi, > > Thanks very much to you and Ovidiu for the responses. I didn't mean to leave > this thread hanging. See inline: > > On 09/28/2015 05:51 PM, Daniel-Constantin Mierla wrote: > >> Were you pulling the backtraces based on the script you pasted in your >> previous email? That should be good source of information to analyze if >> what kamailio was doing. > > > Yes, although as yet I have not been able to actually get the operator to > run a backtrace at the time of the deadlock. It's a psychological and > political problem: they are so eager to restore service that they do not > have the discipline to run my debug script, and jump straight to restarting > Kamailio. > > However, the biggest problem that I see is that if the backtraces reveal > something interesting, it may invite follow-up, e.g. examination of other > frames and values. That would require a core dump. Dumping core for all 8-12 > child processes would take several minutes, as the shm pool is quite large > (4 GB). This is a very high-volume installation. The operator would never go > for that. > > So, if I do get an intriguing backtrace, I don't really know what else to do > to elaborate. > >> I already said, if the is a mutex deadlock, it will be also noticed by >> high cpu usage. Was it the case, or you don't have any access to cpu >> usage history? > > > I don't have CPU usage history, but I will try to get one next time this > happens. > >> If it is just no more sip message routing, but no high cpu usage, then: >> >> - maybe processed were blocked in a lengthily I/O operation (e.g., query >> to database) > > > That's certainly possible. The backtrace will surely reveal that. > >> - maybe someone/something was resetting the network interface (the >> sockets were bound to previous address) -- e.g., it can be done by some >> upgrades of OS or dhcp > > > No, that definitely is not the case. > >> - maybe some limits of OS were reached, the packets were filtered by >> kernel (if you have centos with selinux, be sure it is properly >> configured) > > > I am aware of CentOS's ridiculous default ulimits in CentOS 6.6, and all of > these have been appropriately set to infinity. SELinux is disabled. > > I'll let you know what I find. Thanks for the input! > > -- Alex > > -- > Alex Balashov | Principal | Evariste Systems LLC > 303 Perimeter Center North, Suite 300 > Atlanta, GA 30346 > United States > > Tel: +1-800-250-5920 (toll-free) / +1-678-954-0671 (direct) > Web: http://www.evaristesys.com/, http://www.csrpswitch.com/ > > _______________________________________________ > SIP Express Router (SER) and Kamailio (OpenSER) - sr-users mailing list > sr-users@lists.sip-router.org > http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-users -- VoIP Embedded, Inc. http://www.voipembedded.com _______________________________________________ SIP Express Router (SER) and Kamailio (OpenSER) - sr-users mailing list sr-users@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-users