I think I prefer the simple option. Making the TR config update dependent on 
the cache update seems to be opening a whole other can of worms. 

Rgds,
JvD

> On Feb 2, 2017, at 5:55 AM, Nir Sopher <n...@qwilt.com> wrote:
> 
> Hi All,
> 
> This thread comes to give a wider view of the two different approaches on
> the table for the "management and operations sequences streamlining"
> discussion.
> 
> I would still greatly appreciate a high level discussion of the issue
> itself and the different approaches. I hope the below preliminary example
> algorithms would shed some more light on the differences between the
> approaches and help the community decide which is preferable.
> 
> Thank you all,
> Nir
> 
> 
> 
> ============================================================
> =============================================================
> *"Simple" traffic-ops orchestrated solution highlights*
> 
> In a "simple" solution traffic ops follows the below steps when a delivery
> service list of servers is modified:
> 
>   1. Queue the delivery-service configuration added to the traffic-servers:
>   E.g. Add the new remap rule to "remap.config" of each traffic-server
>   newly assigned to a delivery-service.
>   2. Wait for all [updated] servers to acknowledge that the new
>   configuration was pulled
>   3. Update traffic-router with the new delivery-service cr-config
>   4. Queue the delivery-service configuration removal from the
>   traffic-servers:
>   E.g. Remove the remap rule from the "remap.config" of each
>   traffic-server no longer assigned to a delivery-service.
>   5. Possibly waiting for all [updated] servers to acknowledge that the
>   latest configuration was deployed, before allowing a new configuration
>   cycle.
> 
> 
> Same steps also hold for the "delivery service HOST_REGEXP change":
> #1 - Add the new remap rule to each assigned traffic-server's "remap.config"
> #4 - Remove the old remap rule from each assigned traffic-server's
> "remap.config"
> 
> Many more details are probably missing, but basically, this algorithm is
> relatively simple and clear.
> Additionally, in the first step, the operation may be done in "global"
> scope, and only then improving the solution to work independently
> per delivery-service.
> Furthermore, most changes are likely to be limited to traffic-ops and
> isolated from other flows in the system. Being centralistic may make the
> process more stable as well as easy to debug via proper log messages.
> 
> ============================================================
> ===========================================================
> *"Flexible" traffic-router based solution for delivery-service
> configuration deployment.*
> 
> Lets define a delivery-service configuration "generation". Such a
> "generation" would be an ordinal identifier for the a delivery service
> configuration.
> A "generation" changes whenever a new configuration is applied that changes
> the remap rule at some of the servers, or the content to server assignment.
> Mainly:
> 
>   1. Adding the delivery service
>   2. Assigning new traffic servers to the existing delivery service
>   (changing the "consistent hash" assignment done by traffic router)
>   3. Removing the delivery service
>   4. Removing assigned traffic-servers from the delivery service.
>   5. More complicated scenarios to be discussed:
>      1. Moving a server between cache groups.
>      2. Changing the HOST_REGEXP of the delivery service.
> 
> Under this definition, the remap rules and crconfig.json will be
> conceptually broken into a "per delivery service segments". These segments
> can be managed independently but it is not required in the first step.
> 
> At any give moment, each traffic-server holds a single generation of  a
> "remap rule configuration", for each relevant delivery service.
> The traffic router on the other hand, holds for each known HOST_REGEXP, a
> stack of the relevant "delivery-service cr-config" segments, allowing it to
> maintain a short history.
> Furthermore, the traffic server knows which configuration generation was
> read by which traffic-server for each delivery service. This can be done
> using traffic-monitor via astat.
> 
> The main logic of this solution is implemented in the traffic-router, that
> has to implement some algorithm when redirecting requests to
> traffic-server, taking the "generation" into account,
> For example, when a new get request reaches the traffic router, it can
> follow the below algorithm (optimizations are required):
> 
>   1. Identify the HOST_REGEXP and choosing the "cr-config" stack
>   accordingly.
>   Point to the "top" of the stack.
>   2. Based on the "cr-config" , choose the traffic-server to redirect to.
>   This is done exactly as it is done today based on the the delivery
>   service as well as servers' health*.
>   3. If the chosen server has the proper configuration generation,
>   redirect to it (and we are done)
>   4. Otherwise, move to the next cr-config segment in the stack, and goto
>   "2"
> 
> * A server holding a newer remap configuration generation for the delivery
> service (comparing to the one pointed at in traffic router stack), is
> considered "down" in the content to server assignment calculation.
> Otherwise, the algorithm might end up with no router to redirect to.
> 
> The above algorithm tries to minimize the changes on the system behavior,
> when no change is applied. It also tries to avoid instability / cache
> trashing, by limiting temporary "consistent hash" results during the
> transition.
> 
> In order to provide
> 
> On Thu, Feb 2, 2017 at 2:39 PM, Nir Sopher <n...@qwilt.com> wrote:
> 
>> Hi Eric,
>> Actually, as we imaged it, a "generation" is created only when a new
>> configuration is applied - when the "consistent hash" is permanently
>> modified.
>> 
>> I'll open a separate thread to discuss the technical details further,
>> including an algorithm we have in mind.
>> 
>> I also opened TC-130 - Streamlining TC management and operations sequences
>> <https://issues.apache.org/jira/browse/TC-130> to further monitor the
>> issue.
>> 
>> Would appreciate community inputs about the issue, especially discussing
>> the PROs and CONs of the 2 different approaches:
>> Traffic Ops orchestrated solution vs. A more flexible, traffic-router
>> algorithm based, solution.
>> 
>> Nir
>> 
>> 
>> 
>> 
>> On Wed, Feb 1, 2017 at 3:33 PM, Eric Friedrich (efriedri) <
>> efrie...@cisco.com> wrote:
>> 
>>> Hey Nir-
>>>  Interesting thought for sure.
>>> 
>>> Would TM “health changes” (loss of connectivity, BW/loadavg too high)
>>> change the generation count? It seems like the answer is Yes, because the
>>> health of a cache impacts the state of the consistent hash ring.
>>> 
>>> If so, how do these generation changes get from the Traffic Monitor to
>>> the caches, when config changes typically come only from Traffic Ops and
>>> only when ORT is run?
>>> 
>>> Or maybe the generation count is just an abstraction to conceptualize the
>>> problem space and not a literal approach?
>>> 
>>> —Eric
>>> 
>>>> On Feb 1, 2017, at 4:14 AM, Nir Sopher <n...@qwilt.com> wrote:
>>>> 
>>>> Hi Eric,
>>>> 
>>>> Formalizing the approach you suggested, one may introduce the concept
>>> of a
>>>> delivery-service configuration "generation" which would be an ordinal
>>>> identifier for the a delivery service configuration. A "generation"
>>> changes
>>>> whenever the remap rule changes or the consistent hash mapping of
>>> content
>>>> to server changes (e.g. due to additional server assignment).
>>>> I such a solution, each traffic-server may hold a single generation for
>>>> each delivery service configuration, while traffic-router may hold a
>>>> history of generations and know which server holds which configuration
>>>> generation.
>>>> 
>>>> This approach introduces a considerable flexibility. It allows
>>>> configurations to be set one after the other with no need to wait
>>> between
>>>> them.
>>>> It also fits well with Jeremy's suggestion for queue-update with a
>>> delivery
>>>> service granularity.
>>>> 
>>>> On the other hand, complicated algorithms for solving the issue may
>>> impose
>>>> more risk to the network when applied, comparing to a simple
>>> "traffic-ops"
>>>> orchestrated solution.
>>>> 
>>>> I'm not sure what is preferable from an operator point of view. I'm also
>>>> not familiar with TC 3.0 configuration solution to validate he different
>>>> approaches against.
>>>> 
>>>> Please share your thoughts,
>>>> Thanks,
>>>> Nir
>>>> 
>>>> On Tue, Jan 31, 2017 at 6:26 PM, Eric Friedrich (efriedri) <
>>>> efrie...@cisco.com> wrote:
>>>> 
>>>>> What about an approach (apologies, still light on details), where TR
>>>>> (perhaps still via TM) discovers the availability of delivery services
>>> from
>>>>> the cache itself, rather than from the CRConfig file? (Astats or its
>>>>> remap_stats based replacement would publish its remap rules)
>>>>> 
>>>>> Any changes to the set of servers (add/remove) or DS assignments would
>>> not
>>>>> require a specific step to push a changed config to the router. If a
>>> cache
>>>>> does not yet, or no longer has remap rules for a specific delivery
>>> service,
>>>>> then TR will not see that rule advertised by the cache and will not
>>> send it
>>>>> traffic. If adding or removing a server, TM still needs to be updated
>>> to
>>>>> learn about the new server.
>>>>> 
>>>>> With current configuration, theres a race condition of a few seconds
>>> where
>>>>> a cache removes remap rule before TM polls and TR gets health info
>>> from TM.
>>>>> In these few seconds, TR would erroneously send traffic to a cache
>>> without
>>>>> a proper remap rule.
>>>>> 
>>>>> We could fix this by
>>>>> a) advertising a state of the remap rule in astats to notify TR no
>>>>> longer to send traffic on that DS for a short period before the rule is
>>>>> actually removed - all handled inside of ORT).
>>>>>   or
>>>>> b) prematurely removing the remap rule from astats, before the config
>>> on
>>>>> TS is actually updated (at the cost of missing the final few remap
>>> stats
>>>>> numbers). This is probably unacceptable.
>>>>> 
>>>>> I’m sure there are other variants on this, but my main goal is for TR
>>> to
>>>>> directly learn from the caches which delivery services they actually
>>> have
>>>>> available. Rather than the TR learning what TO only thinks each cache
>>> has
>>>>> available.
>>>>> 
>>>>> —Eric
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Jan 31, 2017, at 8:10 AM, Nir Sopher <n...@qwilt.com> wrote:
>>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> In order to further improve the simplicity and robustness of the
>>> control
>>>>>> path for provisioning infrastructure and delivery services, we are
>>>>>> currently considering ways to streamline management and operations.
>>>>>> 
>>>>>> Currently, when applying changes in traffic-control that require the
>>>>>> synchronization between the traffic-router and traffic-servers, the
>>> user
>>>>>> should be conscious to do so in a certain order. Otherwise, "black
>>> holes"
>>>>>> may be created. Furthermore, in some of the scenarios the user have to
>>>>> wait
>>>>>> and verify that the configuration reached all traffic server before he
>>>>> may
>>>>>> apply it to the traffic-router.
>>>>>> 
>>>>>> We have noticed that TC-3.0 is planned to include a "Config State
>>>>> Machine",
>>>>>> probably dealing with the issue thoroughly. We have no further
>>>>> information
>>>>>> about this bullet and would appreciate any additional info.
>>>>>> 
>>>>>> We would like to start investing in making TC operations more
>>> streamline,
>>>>>> robust and user-friendly.
>>>>>> 
>>>>>> The main use-cases we would like to address at this point are:
>>>>>> 
>>>>>> 1. Assign servers to a Delivery-Service.
>>>>>> For this operation, the configuration must first be applied to the
>>>>> added
>>>>>> traffic servers, propagate, and only then applied to the
>>>>> traffic-router.
>>>>>> 2. Remove servers assignment to a Delivery-Service.
>>>>>> For this operation, the configuration must first be applied to the
>>>>>> traffic-router, and only then to the traffic-servers.
>>>>>> 3. Add a new delivery service.
>>>>>> This is practically a private case of servers assignment to a
>>>>>> delivery-service.
>>>>>> 4. Delete a delivery service.
>>>>>> This is practically a private case of servers assignment removal
>>> from a
>>>>>> delivery-service.
>>>>>> 5. Update settings that must be applied together on the traffic
>>> servers
>>>>>> and the router.
>>>>>> 
>>>>>> We would like to simplify the procedure, allowing the deployment of
>>> new
>>>>>> configuration in a single operation, instead of doing it step by step.
>>>>>> 
>>>>>> One solution can be based on the insight that deploying such
>>>>> configuration
>>>>>> changes may be done by initially updating the traffic server with
>>> added
>>>>>> functionality (e.g remap-rule), then updating the router, and lastly,
>>>>>> removing old functionality from the traffic servers. Such a solution
>>> can
>>>>> be
>>>>>> orchestrated by traffic-ops, probably without complicating other
>>>>> components.
>>>>>> 
>>>>>> Other solutions may provide more flexibility, but would probably
>>> involve
>>>>>> adding complexity to other components such as traffic-router.
>>>>>> 
>>>>>> We would be glad to hear the community's thoughts on the matter, so we
>>>>> can
>>>>>> take this further.
>>>>>> 
>>>>>> Thanks,
>>>>>> Nir
>>>>> 
>>>>> 
>>> 
>>> 
>> 

Reply via email to