On 7/8/16, 9:19 AM, "James Peach" <[email protected]> wrote:
> >> On Jul 7, 2016, at 1:16 PM, Shrihari Kalkar >><[email protected]> wrote: >> >> James, >> Thank you for clarifications. Please find my responses inline. >> >> On 7/7/16, 11:56 AM, "James Peach" <[email protected]> wrote: >> >>> >>>> On Jul 7, 2016, at 9:55 AM, Shrihari Kalkar >>>> <[email protected]> wrote: >>>> >>>> Guys, >>>> I understand that in code, if a stat is registered under >>>> 'proxy.process' it is owned by traffic_server. >>> >>> That is by convention. In the code RECT_PROCESS metrics are owned by >>> traffic_server. Ownership of metrics is not a particularly strong >>>concept >>> though :-/ >> Cool! >> >>> >>>> Also, if a stat is registered under stats.config.xml or >>>>metrics.config, >>>> the 'destination' is owned by traffic_manager. >>>> >>>> Since proxy.process.ssl.total_success_handshake_count is listed in >>>> stats.config.xml and metrics.config, do you think that this is a >>>> conflict of ownership? >>> >>> Grovelling through the git history will probably clear this up for you. >>> IIRC total_success_handshake_count was moved to the derived metrics for >>> compatibility when total_success_handshake_count_in and >>> total_success_handshake_count_out were introduced. Maybe it makes sense >>> now to make total_success_handshake_count the sum of in and out? >> >> Yes, I went through https://issues.apache.org/jira/browse/TS-3409 to >> understand why this was added. I agree that it should be a derived >>metric >> but I found it odd that it was the only derived metric that had a >> destination set to 'proxy.process'. >>> >>>> I see that on my instance, traffic_manager keeps sending an 'RECG_SET' >>>> event for this stat to traffic_server even when nothing is happening. >>>>As >>>> a result, I am seeing an issue where traffic_server blocks for a very >>>> long time for initializing. >>> >>> Hmm. I'm not sure that slow initialization would be related to this. >>>One >>> common cause of initialization lag is setting >>> proxy.config.http.wait_for_cache. I think you need to debug this one >>> further :) >> >> I did some analysis there and here is what I found out. Hopefully I am >>not >> too off on the wrong path. >> >> As part of startup, traffic_server creates two threads (to begin with). >> 1. (main) Thread (1) blocks till its signaled by another thread > >Where do you see this happening? I see this happening in the following code path: Main()->initialize_process_manager()->RecProcessInitMessage() if (mode_type == RECM_CLIENT) { send_pull_message(RECG_PULL_REQ); g_force_req_notify.lock(); g_force_req_notify.wait(); <--- Waits for signal from another thread g_force_req_notify.unlock(); } > >> 1. Thread 2 polls for messages from traffic_manager >> >> traffic_server is waiting for a message from traffic_manager which >> contains all the configuration required for it to go ahead with >> initialization. Hence, I understand that it is critical that the main >> Thread (1) wait till it gets the configuration. >> >> Thread 2 which polls for message from traffic_manager works like this: >> for(;;) { >> if (pmgmt->require_lm) { <--- Always True (for us) >> pmgmt->pollLMConnection(); <--- |for (count = 0; count < 10000; >>count >> ++) >> <--- | num = mgmt_read_timeout(...) <---- >> Blocking call. returns 0 >>if >> nothing was received for 1 second >> <--- | if !num: break <--- Break out of the >> loop and return from >> function >> <--- | else: read(fd), add_to_event_queue, >> continue the loop <--- >> Back to fetching another message >> } >> pmgmt->processEventQueue(); <-- Process the messages received in >> pollLMConnection() >> pmgmt->processSignalQueue(); >> mgmt_sleep_sec(pmgmt->timeout); >> } >> >> I found two possible problems here: >> >> 1. When I looked into the above code, I observed that the >>pollLMConnection >> might not return back for a very long time if it keeps getting messages. >> As a result, we may not call processEventQueue() (till we receive 10k >> messages) which processes the received messages. And unless we process >>the >> messages, we cannot signal the main Thread (1) to continue, which is >>still >> blocked. Hence I saw the issue where traffic_server won't complete >> initialization for a very long time. > >Yes, this could definitely be improved. I'd be happy to help shepherd a >patch if this is something you would like to work on. Sure, I have a patch in mind that I am testing. Let me do some more testing on it and then I will reach out to you to understand the patch process. > >> 2. The second problem I wanted to figure out was why is traffic_server >> receiving so many messages at boot-up? I added some debug logs and I >>found >> that traffic_server kept getting 'RECG' events for the metric >> 'proxy.process.ssl.total_success_handshake_count'. Narrowing it down >> further (in int RecSetRecord(...)) >> From traffic_manager's perspective: >> if (i_am_not_owner_of(stat)) : send traffic_server a notify message. >> Ideally, this code should not be triggered because, traffic_manager >> controls updates to 'proxy.process.ssl.total_success_handshake_count'. >> However, the ownership in the code is determined solely based on the >> 'string name'. Since the stat name contains 'process', it assumes that >>it >> is owned by traffic_server. This leads to an interesting scenario where >> traffic_manger keeps updating its own stat and sends unnecessary events >>to >> traffic_server. I am not still not sure how and when these updates >>happen. > >In Evaluator::eval() we evaluate all the Lua derived metrics and update >them using RecSetRecord(). RecSetRecord() will end up sending a RECG_SET >message for this metric since it is not the owner. The traffic_manager >event loop evaluates the metrics approximately every second (which is >overkill since the per-thread metric caches are not sampled that quickly). Thanks for explaining this! > >> But if the updates happen frequently, due to 'Problem 1' we can prevent >> traffic_server from processing any messages for up to 10,000 seconds! (I >> am imagining the case where the message is received just before the >> timeout of 1 second happens) >> >> >> I see this situation very often on a Virtual machine for some reason. >>Most >> of the times, traffic_cop comes and kills traffic_server because it took >> too long for it to initialize. >> >> I verified that by removing this derived metric, the problem went away. >> However, I have a feeling that we need to fix the 'problem 1' as well. >> >> >> >>> >>>> I wanted to know if my understanding of ownership is correct. And if >>>>it >>>> is, can I open a new bug to track the definition of this stat? >>> >>> Yes, it sounds like it is worth revisiting SSL stats to make sure they >>> still make sense. Last time I looked at these, there were a few >>>problems. >> >> Sure, let me open a bug for this. Even if it ends up being invalid, the >> discussion can be helpful. >>> >>> J >>> >>> >> Thanks, >> Shrihari >> >
