[jira] [Work logged] (TS-4735) Possible deadlock on traffic_server startup

ASF GitHub Bot (JIRA) Thu, 01 Sep 2016 17:29:14 -0700

     [ 
https://issues.apache.org/jira/browse/TS-4735?focusedWorklogId=27857&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-27857
 ]


ASF GitHub Bot logged work on TS-4735:
--------------------------------------

                Author: ASF GitHub Bot
            Created on: 02/Sep/16 00:27
            Start Date: 02/Sep/16 00:27
    Worklog Time Spent: 10m 
      Work Description: Github user jpeach commented on the issue:

    https://github.com/apache/trafficserver/pull/872
  
    @kshri23 I dig into the startup sequence a bit more and I'm now convinced 
that this is a reasonable approach. What do you think about just changing 
``MAX_MSGS_IN_A_ROW`` to something sanely small (like 10)?
    
    @kshri23 You need to run ``make -j clang-format``.


Issue Time Tracking
-------------------

    Worklog Id:     (was: 27857)
    Time Spent: 1h 20m  (was: 1h 10m)

> Possible deadlock on traffic_server startup
> -------------------------------------------
>
>                 Key: TS-4735
>                 URL: https://issues.apache.org/jira/browse/TS-4735
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 6.2.0
>            Reporter: Shrihari
>            Assignee: Shrihari
>             Fix For: 7.0.0
>
>          Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> As part of startup, traffic_server creates two threads (to begin with).
> 1. (main) Thread (1) blocks till its signaled by another thread
> 1. Thread 2 polls for messages from traffic_manager
> It is waiting for a message from traffic_manager which contains all the 
> configuration required for it to go ahead with initialization. Hence, it is 
> critical that the main Thread (1) wait till it gets the configuration.
> Thread 2 which polls for message from traffic_manager works like this:
> {noformat}
> for(;;) {
>   if (pmgmt->require_lm) {     <--- Always True (when using traffic_cop)
>     pmgmt->pollLMConnection();  <--- | for (count = 0; count < 10000; count 
> ++) 
>                                                            |   num = 
> mgmt_read_timeout(...) <---- Blocking call. returns 0 if nothing was received 
> for 1 second
>                                                            |   if !num: break 
> <--- Break out of the loop and return from function 
>                                                            |   else: 
> read(fd), add_to_event_queue, continue the loop, 
>                                                            | Back to fetching 
> another message
>   }
>   pmgmt->processEventQueue();  <--  Process the messages received in 
> pollLMConnection()
>   pmgmt->processSignalQueue();
>   mgmt_sleep_sec(pmgmt->timeout); 
> }
> {noformat}
> RCA:
> There are two problems here:
> 1. If we look into the above code, we should observe that the 
> pollLMConnection might not return back for a very long time if it keeps 
> getting messages. As a result, we may not call processEventQueue() which 
> processes the received messages. And unless we process the messages, we 
> cannot signal the main Thread (1) to continue, which is still blocked. Hence 
> we see the issue where traffic_server won't complete initialization for a 
> very long time.
> 2. The second problem is that why is traffic_server receiving so many 
> messages at boot-up? The problem lies in the configuration. In 6.2.x, we 
> replaced 
> 'proxy.process.ssl.total_success_handshake_count' with 
> 'proxy.process.ssl.total_success_handshake_count_in'. 
> In order to provide backwards compatibility, we defined the old stat in 
> stats.config.xml. The caveat here is that, since this statconfig is defined 
> in stats.config.xml, traffic_manager assumes the responsibility of updating 
> this stat. According to the code:
> {noformat}
> if (i_am_not_owner_of(stat)) : send traffic_server a notify message.
> {noformat}
> Ideally, this code should not be triggered because, traffic_manager does own 
> the stat. However, the ownership in the code is determined solely based on 
> the 'string name'. If the name contains 'process', it is owned by 
> traffic_server. This leads to an interesting scenario where traffic_manger 
> keeps updating its own stat and sends unnecessary events to traffic_server. 
> These updates happen every 1 second (Thanks James for helping me understand 
> this period) which is the same as our timeout in traffic_server.  Due to 
> 'Problem 1' we can prevent traffic_server from processing any messages for up 
> to 10,000 seconds! (Just imagine the case where the message is received just 
> before the timout of 1 second happens)
> I saw this happening with 100% on a VM but 0% on a physical box. I don't have 
> any other results as of now though.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Work logged] (TS-4735) Possible deadlock on traffic_server startup

Reply via email to