[ 
https://issues.apache.org/jira/browse/TS-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13468179#comment-13468179
 ] 

Alan M. Carroll commented on TS-1487:
-------------------------------------

TS-1487

Fix proposal:

1) Add a new initialization function init_HttpProxyServerSockets which would 
open all of the sockets without starting threads or listening on the sockets. 
This can the be called to provide a window between opening the sockets and 
listening on them for plugin initialization.

2) Add a new eventing mechanism for plugins to catch specific ATS level events. 
A plugin would make an API call to register a callback continuation which would 
be invoked for the following event: 

  PORTS_OPEN : sockets for listen ports are open.
  CACHE_RUNNING : cache is now operational
  
It is suggested that we may want to expand this to include SHUTDOWN and 
RECONFIGURE. An alternative would be have a potentially different callback per 
event in the style of TSHttpHookAdd, e.g. TSAtsHookAdd(TSAtsHookID, TSCont). 
("TSSystemHookAdd"?).

3) Plugins would then be initialized as early as possible, which means calling 
TSPluginInit function as early as possible. Plugins that need to perform 
operations at some later point in the ATS lifecyle (e.g., after sockets are 
opened) would set a hook during TSPlugInit and perform the operation in the 
callback. It should be noted that for sockets we cannot guarantee calling the 
plugin before the sockets are open as that may happen even before tbe 
traffic_server process is started. We can only promise that when the 
SOCKET_OPEN callback is invoked, the sockets are open.

This provides a very general mechanism which should be relatively 
straightforward to use and implement and avoids a configuration variable 
(always a feature!). If we find in the future additional lifecycle points at 
which a plugin needs to perform operations these can be added in a fully 
backwards compatible manner. The SPDY plugin would need to be updated but that 
is AFAIK the only plugin that currently is dependent on this ordering. This 
would represent a change from 3.2 but a reversion to the 3.0.X behavior with 
regard to when plugins are initialized which I think is acceptable.

We would still need an additional configuration variable to control the 
ordering of listen thread startup and cache readiness. With the socket opening 
split off as per (1) this would be relatively easy to implement. The primary 
question would be whether main() should just call start_HttpProxyServer and 
pass an event code, or check the event code itself and conditionally call.
                
> the ordering of plugin_init and init_HttpProxyServer cause crashed TS to core 
> endlessly
> ---------------------------------------------------------------------------------------
>
>                 Key: TS-1487
>                 URL: https://issues.apache.org/jira/browse/TS-1487
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 3.2.0
>         Environment: Linux RHEL6.2
>            Reporter: Aidan McGurn
>            Assignee: Alan M. Carroll
>            Priority: Critical
>         Attachments: INTD-529-RespawnCrash.patch, INTD-529-RespawnCrash.patch
>
>
> We've had a serious issue whereby the TS when it crashes re-spawns/cores 
> continuously when its tries to re-start under load. I traced the issue to 
> SNMP research library (a third party lib)- They use selects and what happens 
> is the file descriptor number spikes under load after the crash as all the 
> sockets get opened at once - this causes buffer overflow in the select (which 
> their library is full of) as the fd allocated to the FD_SET is much bigger 
> than the FD_SETSIZE of 1024 (which  was a bitch to track down as the stack 
> was corrupted and gdb therefore useless). Tracing why this happened on 3.2.0 
> and not 3.0.2, I find the sequence 
> of the plugin_init has changed - On 3.0.2 the sequence was in effect  1. 
> plugin_init and then 2. init_HttpProxyServer. Whereas this has mysteriously 
> been reversed on 3.2.0. In order to get our system to work in this crash case 
> , I've patched ATS to flip them around like in 3.0.2.
> i'll attach the patch we propose we need to use to get around this.
> Is this actually a bug then waiting to happen in other systems - Or was there 
> a reason to change this sequence?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to