[ https://issues.apache.org/jira/browse/TS-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13468179#comment-13468179 ]
Alan M. Carroll commented on TS-1487: ------------------------------------- TS-1487 Fix proposal: 1) Add a new initialization function init_HttpProxyServerSockets which would open all of the sockets without starting threads or listening on the sockets. This can the be called to provide a window between opening the sockets and listening on them for plugin initialization. 2) Add a new eventing mechanism for plugins to catch specific ATS level events. A plugin would make an API call to register a callback continuation which would be invoked for the following event: PORTS_OPEN : sockets for listen ports are open. CACHE_RUNNING : cache is now operational It is suggested that we may want to expand this to include SHUTDOWN and RECONFIGURE. An alternative would be have a potentially different callback per event in the style of TSHttpHookAdd, e.g. TSAtsHookAdd(TSAtsHookID, TSCont). ("TSSystemHookAdd"?). 3) Plugins would then be initialized as early as possible, which means calling TSPluginInit function as early as possible. Plugins that need to perform operations at some later point in the ATS lifecyle (e.g., after sockets are opened) would set a hook during TSPlugInit and perform the operation in the callback. It should be noted that for sockets we cannot guarantee calling the plugin before the sockets are open as that may happen even before tbe traffic_server process is started. We can only promise that when the SOCKET_OPEN callback is invoked, the sockets are open. This provides a very general mechanism which should be relatively straightforward to use and implement and avoids a configuration variable (always a feature!). If we find in the future additional lifecycle points at which a plugin needs to perform operations these can be added in a fully backwards compatible manner. The SPDY plugin would need to be updated but that is AFAIK the only plugin that currently is dependent on this ordering. This would represent a change from 3.2 but a reversion to the 3.0.X behavior with regard to when plugins are initialized which I think is acceptable. We would still need an additional configuration variable to control the ordering of listen thread startup and cache readiness. With the socket opening split off as per (1) this would be relatively easy to implement. The primary question would be whether main() should just call start_HttpProxyServer and pass an event code, or check the event code itself and conditionally call. > the ordering of plugin_init and init_HttpProxyServer cause crashed TS to core > endlessly > --------------------------------------------------------------------------------------- > > Key: TS-1487 > URL: https://issues.apache.org/jira/browse/TS-1487 > Project: Traffic Server > Issue Type: Bug > Components: Core > Affects Versions: 3.2.0 > Environment: Linux RHEL6.2 > Reporter: Aidan McGurn > Assignee: Alan M. Carroll > Priority: Critical > Attachments: INTD-529-RespawnCrash.patch, INTD-529-RespawnCrash.patch > > > We've had a serious issue whereby the TS when it crashes re-spawns/cores > continuously when its tries to re-start under load. I traced the issue to > SNMP research library (a third party lib)- They use selects and what happens > is the file descriptor number spikes under load after the crash as all the > sockets get opened at once - this causes buffer overflow in the select (which > their library is full of) as the fd allocated to the FD_SET is much bigger > than the FD_SETSIZE of 1024 (which was a bitch to track down as the stack > was corrupted and gdb therefore useless). Tracing why this happened on 3.2.0 > and not 3.0.2, I find the sequence > of the plugin_init has changed - On 3.0.2 the sequence was in effect 1. > plugin_init and then 2. init_HttpProxyServer. Whereas this has mysteriously > been reversed on 3.2.0. In order to get our system to work in this crash case > , I've patched ATS to flip them around like in 3.0.2. > i'll attach the patch we propose we need to use to get around this. > Is this actually a bug then waiting to happen in other systems - Or was there > a reason to change this sequence? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira