Laurent,

You are of course correct that simply running some ANIMA components would not 
avoid or necessarily repair such incidents. Life is not that easy. However, I 
believe that ANIMA (or other autonomic techniques) have the potential to make 
such incidents either rarer or more quickly repaired.

The ACP is designed to survive partitions and merges, and in the worst case to 
rebuild itself completely, in the absence of any traditionally configured 
control plane or data plane. That would allow ASAs to restart themselves if 
necessary, but in any case to continue their jobs even when normal routing is 
broken. I assume that an important class of ASAs will be those that watch and 
verify normal operations, and in both the incidents reported some kind of 
anomaly detection could have happened within a minute or two. That might not 
lead to immediate diagnosis of the problem, but maybe it could (for example) 
cause an automatic rollback of any recent configuration changes. Even an ASA 
without network access could do that: nothing is working, so roll back the 
recent ACL updates!

A slightly more abstract point is that an autonomic network will in theory need 
fewer configuration updates by human operators, so such problems will be less 
likely.

Regards
   Brian

On 18-Dec-20 21:42, Ciavaglia, Laurent (Nokia - FR/Paris-Saclay) wrote:
> Hi Brian,
> 
> Thanks for sharing interesting incidents and reflecting on the role of ANIMA 
> technologies.
> 
> Reading the report, it seems the outage results from a series of indirectly 
> linked events 
> leading to isolation of a portion of the GCP network.
> 
> For the incident you refer here, how/where would you see ANIMA components to 
> have (helped) avoided the outage?
> Could we expect ANIMA networks to provide better/longer data plane operation 
> in case of failed control plane/control plane functions? Beyond the pure 
> ability to do so, there is a gain/risk trade-off to let a DP run out of sync 
> of its CP.
> Could we expect ANIMA components to have reacted differently and circumvent 
> the issue, preventing the full disconnection of the GCP network portion? Or 
> mitigated at intermediate points? 
> The incident seems to have been triggered by a legitimate/valid configuration 
> change but with resulting with a functionality loosing its access to files it 
> needed to perform. The config. change validation basically didn't notice 
> there was a possible issue.
> Would such a miss been caught by ANIMA components? (e.g. via a different 
> validation-dependencies approach?)
> 
> Why I'm a bit doubtful here is because even if we deploy robust autonomic 
> functions/agents, the above problem seems to originate from an issue in how 
> the system has been configured.
> And even ASAs would still need to get some form/level of initial 
> configuration or guidance.
> 
> 
> Best regards, 
> Laurent
> 
>> -----Original Message-----
>> From: Anima <[email protected]> On Behalf Of Brian E Carpenter
>> Sent: Thursday, December 17, 2020 02:47
>> To: Anima WG <[email protected]>
>> Subject: Re: [Anima] ANIMA when there is a system-wide issue
>>
>> And here's what happens when the control plane itself falls over:
>>
>> https://status.cloud.google.com/incident/zall/20011#20011006
>>
>> It seems pretty clear that Cloud needs ANIMA.
>>
>> Regards
>>    Brian
>>
>> On 01-Dec-20 11:02, Brian E Carpenter wrote:
>>> "AWS reveals it broke itself by exceeding OS thread limits"
>>>
>>> https://www.theregister.com/2020/11/30/aws_outage_explanation/
>>>
>>> Especially:
>>> "The TIFU-like post also outlines why Amazon's dashboards offered only
>> scanty info about the incident – because they, too, depend on a service
>> that depends on Kinesis."
>>>
>>> Perhaps there is something we should specify in ANIMA to prevent the
>> ANIMA infrastructure falling into this sort of trap: when there is a
>> system-wide issue (such as hitting an O/S resource limit everywhere at the
>> same time) it also prevents the autonomic mechanisms from working.
>>>
>>> Regards
>>>    Brian Carpenter
>>>
>>
>> _______________________________________________
>> Anima mailing list
>> [email protected]
>> https://www.ietf.org/mailman/listinfo/anima

_______________________________________________
Anima mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/anima

Reply via email to