Hi all,
I am working on a fix for TS-4475 which is a core dump in
LogCollationClientSM.cc. I gather that there is not a lot of interest in the
Log Collation client so I am reaching out to the mailing list especially for
folks familiar with inactivity cop.
Background: The core dump occurs when a VC_INACTIVITY_TIMEOUT_EVENT == 105 is
sent to various of the state machine's event handlers This event is not coded
into the switch(event) so it core dumps as follows due to the following code -
default:
ink_assert(!"unexpcted state");
return EVENT_CONT;
We have been seeing this inactivity event occur recently since we changed the
net inactivity default timer from 86400 to 300s (in order to use inactivity_cop
to help get rid of hanging connections during an over-loaded traffic
condition). We are using ATS as a log collation client, i.e., we have defined
log collation hosts in logs_xml.config.
I have inserted the following code into the various event handlers to "ignore"
the time-out event -
+ case VC_EVENT_INACTIVITY_TIMEOUT:
+ Note("[LogCollationClientSM] - ignoring VC_EVENT_INACTIVITY_TIMEOUT");
+ return EVENT_CONT;
case VC_EVENT_EOS:
This seems to work (generating a NOTE in diags.log every five minutes), but I
am concerned whether NetHandler::manage_active_queue() would be trying to close
the VC even if I ignore the time-out event in the event handler?
Example from diags.log -
[Jun 24 00:39:35.031] Server {0x2b387ec7c700} NOTE: [LogCollationClientSM] -
ignoring VC_EVENT_INACTIVITY_TIMEOUT
Debug from traffic.out -
[Jun 24 00:39:35.031] Server {0x2b387ec7c700} DEBUG: (inactivity_cop_verbose)
vc: 0x2b38c0018d00 now: 1466728775031910904 timeout at
: 1466728774 timeout in: 300
[Jun 24 00:39:35.031] Server {0x2b387ec7c700} DEBUG: (inactivity_cop_verbose)
vc: 0x2b38c0018d00 now: 1466728775031910904 timeout at
: 1466728774291725549 timeout in: 300000000000
[Jun 24 00:39:37.024] Server {0x2b387ec7c700} DEBUG: (inactivity_cop_verbose)
vc: 0x2b38c0018d00 now: 1466728777024229232 timeout at
: 1466729076 timeout in: 300
It seems like the net effect is just to advance the time-out another ~ 300s
upon each time-out. Note that the debug message numbers seem to change units at
the transition point.
Appreciate any recommendations or comments on this solution.
Thanks,
Peter