Re: Ownership of proxy.process.ssl.total_success_handshake_count

Shrihari Kalkar Fri, 08 Jul 2016 10:00:45 -0700

On 7/8/16, 9:19 AM, "James Peach" <[email protected]> wrote:


>
>> On Jul 7, 2016, at 1:16 PM, Shrihari Kalkar
>><[email protected]> wrote:
>> 
>> James,
>> Thank you for clarifications. Please find my responses inline.
>> 
>> On 7/7/16, 11:56 AM, "James Peach" <[email protected]> wrote:
>> 
>>> 
>>>> On Jul 7, 2016, at 9:55 AM, Shrihari Kalkar
>>>> <[email protected]> wrote:
>>>> 
>>>> Guys,
>>>> I understand that in code, if a stat is registered under
>>>> 'proxy.process' it is owned by traffic_server.
>>> 
>>> That is by convention. In the code RECT_PROCESS metrics are owned by
>>> traffic_server. Ownership of metrics is not a particularly strong
>>>concept
>>> though :-/
>> Cool!
>> 
>>> 
>>>> Also, if a stat is registered under stats.config.xml or
>>>>metrics.config,
>>>> the 'destination' is owned by traffic_manager.
>>>> 
>>>> Since proxy.process.ssl.total_success_handshake_count is listed in
>>>> stats.config.xml and metrics.config, do you think that this is a
>>>> conflict of ownership?
>>> 
>>> Grovelling through the git history will probably clear this up for you.
>>> IIRC total_success_handshake_count was moved to the derived metrics for
>>> compatibility when total_success_handshake_count_in and
>>> total_success_handshake_count_out were introduced. Maybe it makes sense
>>> now to make total_success_handshake_count the sum of in and out?
>> 
>> Yes, I went through https://issues.apache.org/jira/browse/TS-3409 to
>> understand why this was added. I agree that it should be a derived
>>metric
>> but I found it odd that it was the only derived metric that had a
>> destination set to 'proxy.process'.
>>> 
>>>> I see that on my instance, traffic_manager keeps sending an 'RECG_SET'
>>>> event for this stat to traffic_server even when nothing is happening.
>>>>As
>>>> a result, I am seeing an issue where traffic_server blocks for a very
>>>> long time for initializing.
>>> 
>>> Hmm. I'm not sure that slow initialization would be related to this.
>>>One
>>> common cause of initialization lag is setting
>>> proxy.config.http.wait_for_cache. I think you need to debug this one
>>> further :)
>> 
>> I did some analysis there and here is what I found out. Hopefully I am
>>not
>> too off on the wrong path.
>> 
>> As part of startup, traffic_server creates two threads (to begin with).
>> 1. (main) Thread (1) blocks till its signaled by another thread
>
>Where do you see this happening?

I see this happening in the following code path:
Main()->initialize_process_manager()->RecProcessInitMessage()
if (mode_type == RECM_CLIENT) {
        send_pull_message(RECG_PULL_REQ);
        g_force_req_notify.lock();
        g_force_req_notify.wait(); <--- Waits for signal from another thread
        g_force_req_notify.unlock();
  }



>
>> 1. Thread 2 polls for messages from traffic_manager
>> 
>> traffic_server is waiting for a message from traffic_manager which
>> contains all the configuration required for it to go ahead with
>> initialization. Hence, I understand that it is critical that the main
>> Thread (1) wait till it gets the configuration.
>> 
>> Thread 2 which polls for message from traffic_manager works like this:
>> for(;;) {
>>  if (pmgmt->require_lm) {     <--- Always True (for us)
>>    pmgmt->pollLMConnection();        <--- |for (count = 0; count < 10000;
>>count
>> ++) 
>>                              <--- |  num = mgmt_read_timeout(...) <---- 
>> Blocking call. returns 0
>>if
>> nothing was received for 1 second
>>                              <--- |  if !num: break <--- Break out of the 
>> loop and return from
>> function 
>>                              <--- |  else: read(fd), add_to_event_queue, 
>> continue the loop <---
>> Back to fetching another message
>>  }
>>  pmgmt->processEventQueue();  <--  Process the messages received in
>> pollLMConnection()
>>  pmgmt->processSignalQueue();
>>  mgmt_sleep_sec(pmgmt->timeout);
>> }
>> 
>> I found two possible problems here:
>> 
>> 1. When I looked into the above code, I observed that the
>>pollLMConnection
>> might not return back for a very long time if it keeps getting messages.
>> As a result, we may not call processEventQueue() (till we receive 10k
>> messages) which processes the received messages. And unless we process
>>the
>> messages, we cannot signal the main Thread (1) to continue, which is
>>still
>> blocked. Hence I saw the issue where traffic_server won't complete
>> initialization for a very long time.
>
>Yes, this could definitely be improved. I'd be happy to help shepherd a
>patch if this is something you would like to work on.

Sure, I have a patch in mind that I am testing. Let me do some more
testing on it and then I will reach out to you to understand the patch
process.

>
>> 2. The second problem I wanted to figure out was why is traffic_server
>> receiving so many messages at boot-up? I added some debug logs and I
>>found
>> that traffic_server kept getting 'RECG' events for the metric
>> 'proxy.process.ssl.total_success_handshake_count'. Narrowing it down
>> further (in int RecSetRecord(...))
>>      From traffic_manager's perspective:
>>         if (i_am_not_owner_of(stat)) : send traffic_server a notify message.
>> Ideally, this code should not be triggered because, traffic_manager
>> controls updates to 'proxy.process.ssl.total_success_handshake_count'.
>> However, the ownership in the code is determined solely based on the
>> 'string name'. Since the stat name contains 'process', it assumes that
>>it
>> is owned by traffic_server. This leads to an interesting scenario where
>> traffic_manger keeps updating its own stat and sends unnecessary events
>>to
>> traffic_server. I am not still not sure how and when these updates
>>happen.
>
>In Evaluator::eval() we evaluate all the Lua derived metrics and update
>them using RecSetRecord(). RecSetRecord() will end up sending a RECG_SET
>message for this metric since it is not the owner. The traffic_manager
>event loop evaluates the metrics approximately every second (which is
>overkill since the per-thread metric caches are not sampled that quickly).

Thanks for explaining this!
>
>> But if the updates happen frequently, due to 'Problem 1' we can prevent
>> traffic_server from processing any messages for up to 10,000 seconds! (I
>> am imagining the case where the message is received just before the
>> timeout of 1 second happens)
>> 
>> 
>> I see this situation very often on a Virtual machine for some reason.
>>Most
>> of the times, traffic_cop comes and kills traffic_server because it took
>> too long for it to initialize.
>> 
>> I verified that by removing this derived metric, the problem went away.
>> However, I have a feeling that we need to fix the 'problem 1' as well.
>> 
>> 
>> 
>>> 
>>>> I wanted to know if my understanding of ownership is correct. And if
>>>>it
>>>> is, can I open a new bug to track the definition of this stat?
>>> 
>>> Yes, it sounds like it is worth revisiting SSL stats to make sure they
>>> still make sense. Last time I looked at these, there were a few
>>>problems.
>> 
>> Sure, let me open a bug for this. Even if it ends up being invalid, the
>> discussion can be helpful.
>>> 
>>> J
>>> 
>>> 
>> Thanks,
>> Shrihari
>> 
>

Re: Ownership of proxy.process.ssl.total_success_handshake_count

Reply via email to