I think I prefer the simple option. Making the TR config update dependent on the cache update seems to be opening a whole other can of worms.
Rgds, JvD > On Feb 2, 2017, at 5:55 AM, Nir Sopher <n...@qwilt.com> wrote: > > Hi All, > > This thread comes to give a wider view of the two different approaches on > the table for the "management and operations sequences streamlining" > discussion. > > I would still greatly appreciate a high level discussion of the issue > itself and the different approaches. I hope the below preliminary example > algorithms would shed some more light on the differences between the > approaches and help the community decide which is preferable. > > Thank you all, > Nir > > > > ============================================================ > ============================================================= > *"Simple" traffic-ops orchestrated solution highlights* > > In a "simple" solution traffic ops follows the below steps when a delivery > service list of servers is modified: > > 1. Queue the delivery-service configuration added to the traffic-servers: > E.g. Add the new remap rule to "remap.config" of each traffic-server > newly assigned to a delivery-service. > 2. Wait for all [updated] servers to acknowledge that the new > configuration was pulled > 3. Update traffic-router with the new delivery-service cr-config > 4. Queue the delivery-service configuration removal from the > traffic-servers: > E.g. Remove the remap rule from the "remap.config" of each > traffic-server no longer assigned to a delivery-service. > 5. Possibly waiting for all [updated] servers to acknowledge that the > latest configuration was deployed, before allowing a new configuration > cycle. > > > Same steps also hold for the "delivery service HOST_REGEXP change": > #1 - Add the new remap rule to each assigned traffic-server's "remap.config" > #4 - Remove the old remap rule from each assigned traffic-server's > "remap.config" > > Many more details are probably missing, but basically, this algorithm is > relatively simple and clear. > Additionally, in the first step, the operation may be done in "global" > scope, and only then improving the solution to work independently > per delivery-service. > Furthermore, most changes are likely to be limited to traffic-ops and > isolated from other flows in the system. Being centralistic may make the > process more stable as well as easy to debug via proper log messages. > > ============================================================ > =========================================================== > *"Flexible" traffic-router based solution for delivery-service > configuration deployment.* > > Lets define a delivery-service configuration "generation". Such a > "generation" would be an ordinal identifier for the a delivery service > configuration. > A "generation" changes whenever a new configuration is applied that changes > the remap rule at some of the servers, or the content to server assignment. > Mainly: > > 1. Adding the delivery service > 2. Assigning new traffic servers to the existing delivery service > (changing the "consistent hash" assignment done by traffic router) > 3. Removing the delivery service > 4. Removing assigned traffic-servers from the delivery service. > 5. More complicated scenarios to be discussed: > 1. Moving a server between cache groups. > 2. Changing the HOST_REGEXP of the delivery service. > > Under this definition, the remap rules and crconfig.json will be > conceptually broken into a "per delivery service segments". These segments > can be managed independently but it is not required in the first step. > > At any give moment, each traffic-server holds a single generation of a > "remap rule configuration", for each relevant delivery service. > The traffic router on the other hand, holds for each known HOST_REGEXP, a > stack of the relevant "delivery-service cr-config" segments, allowing it to > maintain a short history. > Furthermore, the traffic server knows which configuration generation was > read by which traffic-server for each delivery service. This can be done > using traffic-monitor via astat. > > The main logic of this solution is implemented in the traffic-router, that > has to implement some algorithm when redirecting requests to > traffic-server, taking the "generation" into account, > For example, when a new get request reaches the traffic router, it can > follow the below algorithm (optimizations are required): > > 1. Identify the HOST_REGEXP and choosing the "cr-config" stack > accordingly. > Point to the "top" of the stack. > 2. Based on the "cr-config" , choose the traffic-server to redirect to. > This is done exactly as it is done today based on the the delivery > service as well as servers' health*. > 3. If the chosen server has the proper configuration generation, > redirect to it (and we are done) > 4. Otherwise, move to the next cr-config segment in the stack, and goto > "2" > > * A server holding a newer remap configuration generation for the delivery > service (comparing to the one pointed at in traffic router stack), is > considered "down" in the content to server assignment calculation. > Otherwise, the algorithm might end up with no router to redirect to. > > The above algorithm tries to minimize the changes on the system behavior, > when no change is applied. It also tries to avoid instability / cache > trashing, by limiting temporary "consistent hash" results during the > transition. > > In order to provide > > On Thu, Feb 2, 2017 at 2:39 PM, Nir Sopher <n...@qwilt.com> wrote: > >> Hi Eric, >> Actually, as we imaged it, a "generation" is created only when a new >> configuration is applied - when the "consistent hash" is permanently >> modified. >> >> I'll open a separate thread to discuss the technical details further, >> including an algorithm we have in mind. >> >> I also opened TC-130 - Streamlining TC management and operations sequences >> <https://issues.apache.org/jira/browse/TC-130> to further monitor the >> issue. >> >> Would appreciate community inputs about the issue, especially discussing >> the PROs and CONs of the 2 different approaches: >> Traffic Ops orchestrated solution vs. A more flexible, traffic-router >> algorithm based, solution. >> >> Nir >> >> >> >> >> On Wed, Feb 1, 2017 at 3:33 PM, Eric Friedrich (efriedri) < >> efrie...@cisco.com> wrote: >> >>> Hey Nir- >>> Interesting thought for sure. >>> >>> Would TM “health changes” (loss of connectivity, BW/loadavg too high) >>> change the generation count? It seems like the answer is Yes, because the >>> health of a cache impacts the state of the consistent hash ring. >>> >>> If so, how do these generation changes get from the Traffic Monitor to >>> the caches, when config changes typically come only from Traffic Ops and >>> only when ORT is run? >>> >>> Or maybe the generation count is just an abstraction to conceptualize the >>> problem space and not a literal approach? >>> >>> —Eric >>> >>>> On Feb 1, 2017, at 4:14 AM, Nir Sopher <n...@qwilt.com> wrote: >>>> >>>> Hi Eric, >>>> >>>> Formalizing the approach you suggested, one may introduce the concept >>> of a >>>> delivery-service configuration "generation" which would be an ordinal >>>> identifier for the a delivery service configuration. A "generation" >>> changes >>>> whenever the remap rule changes or the consistent hash mapping of >>> content >>>> to server changes (e.g. due to additional server assignment). >>>> I such a solution, each traffic-server may hold a single generation for >>>> each delivery service configuration, while traffic-router may hold a >>>> history of generations and know which server holds which configuration >>>> generation. >>>> >>>> This approach introduces a considerable flexibility. It allows >>>> configurations to be set one after the other with no need to wait >>> between >>>> them. >>>> It also fits well with Jeremy's suggestion for queue-update with a >>> delivery >>>> service granularity. >>>> >>>> On the other hand, complicated algorithms for solving the issue may >>> impose >>>> more risk to the network when applied, comparing to a simple >>> "traffic-ops" >>>> orchestrated solution. >>>> >>>> I'm not sure what is preferable from an operator point of view. I'm also >>>> not familiar with TC 3.0 configuration solution to validate he different >>>> approaches against. >>>> >>>> Please share your thoughts, >>>> Thanks, >>>> Nir >>>> >>>> On Tue, Jan 31, 2017 at 6:26 PM, Eric Friedrich (efriedri) < >>>> efrie...@cisco.com> wrote: >>>> >>>>> What about an approach (apologies, still light on details), where TR >>>>> (perhaps still via TM) discovers the availability of delivery services >>> from >>>>> the cache itself, rather than from the CRConfig file? (Astats or its >>>>> remap_stats based replacement would publish its remap rules) >>>>> >>>>> Any changes to the set of servers (add/remove) or DS assignments would >>> not >>>>> require a specific step to push a changed config to the router. If a >>> cache >>>>> does not yet, or no longer has remap rules for a specific delivery >>> service, >>>>> then TR will not see that rule advertised by the cache and will not >>> send it >>>>> traffic. If adding or removing a server, TM still needs to be updated >>> to >>>>> learn about the new server. >>>>> >>>>> With current configuration, theres a race condition of a few seconds >>> where >>>>> a cache removes remap rule before TM polls and TR gets health info >>> from TM. >>>>> In these few seconds, TR would erroneously send traffic to a cache >>> without >>>>> a proper remap rule. >>>>> >>>>> We could fix this by >>>>> a) advertising a state of the remap rule in astats to notify TR no >>>>> longer to send traffic on that DS for a short period before the rule is >>>>> actually removed - all handled inside of ORT). >>>>> or >>>>> b) prematurely removing the remap rule from astats, before the config >>> on >>>>> TS is actually updated (at the cost of missing the final few remap >>> stats >>>>> numbers). This is probably unacceptable. >>>>> >>>>> I’m sure there are other variants on this, but my main goal is for TR >>> to >>>>> directly learn from the caches which delivery services they actually >>> have >>>>> available. Rather than the TR learning what TO only thinks each cache >>> has >>>>> available. >>>>> >>>>> —Eric >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>> On Jan 31, 2017, at 8:10 AM, Nir Sopher <n...@qwilt.com> wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> In order to further improve the simplicity and robustness of the >>> control >>>>>> path for provisioning infrastructure and delivery services, we are >>>>>> currently considering ways to streamline management and operations. >>>>>> >>>>>> Currently, when applying changes in traffic-control that require the >>>>>> synchronization between the traffic-router and traffic-servers, the >>> user >>>>>> should be conscious to do so in a certain order. Otherwise, "black >>> holes" >>>>>> may be created. Furthermore, in some of the scenarios the user have to >>>>> wait >>>>>> and verify that the configuration reached all traffic server before he >>>>> may >>>>>> apply it to the traffic-router. >>>>>> >>>>>> We have noticed that TC-3.0 is planned to include a "Config State >>>>> Machine", >>>>>> probably dealing with the issue thoroughly. We have no further >>>>> information >>>>>> about this bullet and would appreciate any additional info. >>>>>> >>>>>> We would like to start investing in making TC operations more >>> streamline, >>>>>> robust and user-friendly. >>>>>> >>>>>> The main use-cases we would like to address at this point are: >>>>>> >>>>>> 1. Assign servers to a Delivery-Service. >>>>>> For this operation, the configuration must first be applied to the >>>>> added >>>>>> traffic servers, propagate, and only then applied to the >>>>> traffic-router. >>>>>> 2. Remove servers assignment to a Delivery-Service. >>>>>> For this operation, the configuration must first be applied to the >>>>>> traffic-router, and only then to the traffic-servers. >>>>>> 3. Add a new delivery service. >>>>>> This is practically a private case of servers assignment to a >>>>>> delivery-service. >>>>>> 4. Delete a delivery service. >>>>>> This is practically a private case of servers assignment removal >>> from a >>>>>> delivery-service. >>>>>> 5. Update settings that must be applied together on the traffic >>> servers >>>>>> and the router. >>>>>> >>>>>> We would like to simplify the procedure, allowing the deployment of >>> new >>>>>> configuration in a single operation, instead of doing it step by step. >>>>>> >>>>>> One solution can be based on the insight that deploying such >>>>> configuration >>>>>> changes may be done by initially updating the traffic server with >>> added >>>>>> functionality (e.g remap-rule), then updating the router, and lastly, >>>>>> removing old functionality from the traffic servers. Such a solution >>> can >>>>> be >>>>>> orchestrated by traffic-ops, probably without complicating other >>>>> components. >>>>>> >>>>>> Other solutions may provide more flexibility, but would probably >>> involve >>>>>> adding complexity to other components such as traffic-router. >>>>>> >>>>>> We would be glad to hear the community's thoughts on the matter, so we >>>>> can >>>>>> take this further. >>>>>> >>>>>> Thanks, >>>>>> Nir >>>>> >>>>> >>> >>> >>