Quick Summary of the functional CSIT Runs for Netvirt cases with tell-based proto = ON as well as OFF conditions.
We are triggering one more set of test with and without tell-based protoc enabled because some failures in first set indicates that the VMs themselves were not reachable via SSH. So, in order to give benefit-of-doubt to tell-based protocol and to check if there are orthogonal issues in the test environment. We will publish those results too. Anyway, Ajay’s observations with BGP-PCEP indicates that it would not be prudent to enable tell-based protocol as at now and it requires more deeper assessments. Tell Based proto = ON Results : https://jenkins.opendaylight.org/sandbox/job/netvirt-csit-3node-openstack-ocata-gate-stateful-oxygen/1/ Pass Rate : 68.8 % Tell Based proto = OFF Results : https://jenkins.opendaylight.org/sandbox/job/netvirt-csit-3node-openstack-ocata-gate-stateful-oxygen/2/ Pass Rate : 84.9 % Quick observations from the karaf logs on first set of run : 1) Node Unreachability followed by association-failure is as frequent in both runs 2) But in case of Tell-Based proto scenario, association-failure triggers more frequent re-election (not sure why this is not so rampant because association-failure potentially triggering election is orthogonal to tell / ask enablement) 3) As consequence of (2), determining the routee shard times-out very frequently in case of tell-based protoc Will come back with observations on second set of runs soon Regards Muthu From: Ajay Lele [mailto:[email protected]] Sent: Friday, January 12, 2018 12:19 PM To: Muthukumaran K Cc: Robert Varga; Sam Hague; controller-dev; [email protected]<mailto:[email protected]>; Kency Kurian; Bgpcep-Dev; Luis Gomez Palacios Subject: Re: [controller-dev] Should application code persist do retries on TransactionCommitFailedException caused by AskTimeoutException or could CDS be configured to retry more? On Thu, Jan 11, 2018 at 9:40 PM, Muthukumaran K <[email protected]<mailto:[email protected]>> wrote: Hi Sam, Robert, On the observations which were made as early as September 2017 - https://lists.opendaylight.org/pipermail/netvirt-dev/2017-September/005518.html (thanks to Jamo for testing this out) Enabling tell based protocol had 22% failure of CSIT at releng level. More details on the last sandbox and releng runs below Having said that, since this is a 3 month old result and multiple changes would have gone into netvirt + genius itself, it would be prudential to test the same with the latest Oxygen build (at least it would reduce the possibility of misinterpreting netvirt + genius related issues as MD-SAL related issues). We will do one more sandbox run here at Ericsson with latest ODL Master and re-publish the results with and without tell-based protocol enabled by mid of next week. We will also try to run one round of bulk-flow provisioning with OFPlugin's bulk-o-matic test driver to see the scale behavior of tell-based protocol too. Actually two runs were performed one on releng and another in sandbox between last week of August and mid of September 2017 against Nitrogen : Releng run : ========== https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-3node-openstack-ocata-gate-stateful-nitrogen/7/log.html.gz Sandbox run : =========== https://logs.opendaylight.org/sandbox/jenkins091/netvirt-csit-3node-openstack-ocata-jamo-upstream-stateful-nitrogen/1/odl1_karaf.log.gz Jamo's observations from sandbox run : results are not good. Looks like things pass from a black box perspective in our first l2 connectivity suite, but then lots of failures after that. I also notice that our non-failing keyword to write to the karaf log using ssh to the karaf shell is failing, even in the above passing suite. Also, it's worth noting that in order to enable tell-based protocol I'm just stealing a controller robot suite to do the work and running it first. It makes the config change and reboots all the controllers. In one karaf log (I only looked at one) I saw a bunch of WARN messages about "Unknown history .... ignoring..." example: FrontendClientMetadataBuilder | 215 - org.opendaylight.controller.sa<http://org.opendaylight.controller.sa>l-distributed-datastore - 1.7.0.SNAPSHOT | member 1-shard-topology-operational: Unknown history for aborted transaction member-1-datastore-operational-fe-4-txn-7810-1, ignoring I also saw an ERROR about failure to serialize something or other: 2017-08-29 04:25:12,719 | ERROR | -dispatcher-3279 | EndpointWriter | 41 - com.typesafe.akka.slf4j - 2.4.18 | Failed to serialize remote message [class akka.actor.Status$Failure] | using serializer [class akka.serialization.JavaSerializer]. Transient association error (association remains live) akka.remote.MessageSerializer$SerializationException: Failed to serialize remote message [class akka.actor.Status$Failure] using serializer [class akka.serialization.JavaSerializer]. Observations: =========== -----Original Message----- From: Robert Varga [mailto:[email protected]<mailto:[email protected]>] Sent: Friday, January 12, 2018 2:11 AM To: Sam Hague Cc: Michael Vorburger; Muthukumaran K; Tom Pantelis; controller-dev; [email protected]<mailto:[email protected]>; Kency Kurian Subject: Re: [controller-dev] Should application code persist do retries on TransactionCommitFailedException caused by AskTimeoutException or could CDS be configured to retry more? Regards Muthu On 11/01/18 21:26, Sam Hague wrote: > Robert, > > when you mention odlparent/yangtools integrated - what does that mean? I meant the yangtools-2.0.0 stuff needs to be merged up -- which obviously was delayed way longer than anticipated. > do we think that will happen for oxygen? I would love to have it in, but it does have potential to cause breakage -- hence I am afraid we are out of runway. > There are a number of clustering bugs open that all have > AskTimeoutException listed in the traces. I think the idea is the tell > based change will help and then we can dig deeper if the bugs still exist. Yup. > Muthu, > > how did your testing with tell for netvirt tests go? Were we safe > switching to it? *This* is the most critical question that needs to be answered. If netvirt and BGP greenlight it, I think we can make the switch ... +bgpcep-dev For BGP the last status seems to be the below: 1. BGPCEP-392: BGP scaling target not met in 3-node cluster [0] 2. CONTROLLER-1645: shard moved during 1M bgp prefix advertizing (with tell-based=true) [1] -- see recent comments, this is occasionally seen with 300k prefixes as well CSIT test corresponding to above issues is currently red on oxygen [2] [3] 2. CONTROLLER-1715: 6 GB heap is not entirely enough for BGP ingest test with 1 million prefixes when tell-based protocol is used [4] [0] https://jira.opendaylight.org/browse/BGPCEP-392 [1] https://jira.opendaylight.org/browse/CONTROLLER-1645 [2] https://jenkins.opendaylight.org/releng/view/bgpcep/job/bgpcep-csit-3node-bgpclustering-longevity-only-oxygen/ [3] https://jenkins.opendaylight.org/releng/view/bgpcep/job/bgpcep-csit-3node-periodic-bgpclustering-all-oxygen/ [4] https://jira.opendaylight.org/browse/CONTROLLER-1715 Regards Ajay Regards, Robert _______________________________________________ controller-dev mailing list [email protected]<mailto:[email protected]> https://lists.opendaylight.org/mailman/listinfo/controller-dev
_______________________________________________ controller-dev mailing list [email protected] https://lists.opendaylight.org/mailman/listinfo/controller-dev
