On Thu, Jan 11, 2018 at 9:40 PM, Muthukumaran K <[email protected] > wrote:
> Hi Sam, Robert, > > On the observations which were made as early as September 2017 - > https://lists.opendaylight.org/pipermail/netvirt-dev/2017- > September/005518.html (thanks to Jamo for testing this out) > Enabling tell based protocol had 22% failure of CSIT at releng level. More > details on the last sandbox and releng runs below > > Having said that, since this is a 3 month old result and multiple changes > would have gone into netvirt + genius itself, it would be prudential to > test the same with the latest Oxygen build (at least it would reduce the > possibility of misinterpreting netvirt + genius related issues as MD-SAL > related issues). We will do one more sandbox run here at Ericsson with > latest ODL Master and re-publish the results with and without tell-based > protocol enabled by mid of next week. We will also try to run one round of > bulk-flow provisioning with OFPlugin's bulk-o-matic test driver to see the > scale behavior of tell-based protocol too. > > Actually two runs were performed one on releng and another in sandbox > between last week of August and mid of September 2017 against Nitrogen : > > Releng run : > ========== > https://logs.opendaylight.org/releng/jenkins092/netvirt-csit > -3node-openstack-ocata-gate-stateful-nitrogen/7/log.html.gz > > Sandbox run : > =========== > https://logs.opendaylight.org/sandbox/jenkins091/netvirt-csi > t-3node-openstack-ocata-jamo-upstream-stateful-nitrogen/1/ > odl1_karaf.log.gz > > Jamo's observations from sandbox run : > results are not good. Looks like things pass from a black box perspective > in our first l2 connectivity suite, but then lots of failures after that. > > I also notice that our non-failing keyword to write to the karaf log using > ssh to the karaf shell is failing, even in the above passing suite. > > Also, it's worth noting that in order to enable tell-based protocol I'm > just stealing a controller robot suite to do the work and running it first. > It makes the config change and reboots all the controllers. > > In one karaf log (I only looked at one) I saw a bunch of WARN messages > about "Unknown history .... ignoring..." > example: > > FrontendClientMetadataBuilder | 215 - > org.opendaylight.controller.sal-distributed-datastore > - 1.7.0.SNAPSHOT | member > 1-shard-topology-operational: Unknown history for aborted transaction > member-1-datastore-operational-fe-4-txn-7810-1, ignoring > > I also saw an ERROR about failure to serialize something or other: > > 2017-08-29 04:25:12,719 | ERROR | -dispatcher-3279 | EndpointWriter > | 41 - com.typesafe.akka.slf4j - 2.4.18 > | Failed to serialize remote message [class akka.actor.Status$Failure] > | using serializer [class > akka.serialization.JavaSerializer]. Transient association error > (association remains live) > akka.remote.MessageSerializer$SerializationException: Failed to serialize > remote message [class akka.actor.Status$Failure] using serializer [class > akka.serialization.JavaSerializer]. > > Observations: > =========== > > -----Original Message----- > From: Robert Varga [mailto:[email protected]] > Sent: Friday, January 12, 2018 2:11 AM > To: Sam Hague > Cc: Michael Vorburger; Muthukumaran K; Tom Pantelis; controller-dev; > [email protected]; Kency Kurian > Subject: Re: [controller-dev] Should application code persist do retries > on TransactionCommitFailedException caused by AskTimeoutException or > could CDS be configured to retry more? > > Regards > Muthu > > > On 11/01/18 21:26, Sam Hague wrote: > > Robert, > > > > when you mention odlparent/yangtools integrated - what does that mean? > > I meant the yangtools-2.0.0 stuff needs to be merged up -- which obviously > was delayed way longer than anticipated. > > > do we think that will happen for oxygen? > > I would love to have it in, but it does have potential to cause breakage > -- hence I am afraid we are out of runway. > > > There are a number of clustering bugs open that all have > > AskTimeoutException listed in the traces. I think the idea is the tell > > based change will help and then we can dig deeper if the bugs still > exist. > > Yup. > > > Muthu, > > > > how did your testing with tell for netvirt tests go? Were we safe > > switching to it? > > *This* is the most critical question that needs to be answered. If netvirt > and BGP greenlight it, I think we can make the switch ... > +bgpcep-dev For BGP the last status seems to be the below: 1. BGPCEP-392: BGP scaling target not met in 3-node cluster [0] 2. CONTROLLER-1645: shard moved during 1M bgp prefix advertizing (with tell-based=true) [1] -- see recent comments, this is occasionally seen with 300k prefixes as well CSIT test corresponding to above issues is currently red on oxygen [2] [3] 2. CONTROLLER-1715: 6 GB heap is not entirely enough for BGP ingest test with 1 million prefixes when tell-based protocol is used [4] [0] https://jira.opendaylight.org/browse/BGPCEP-392 [1] https://jira.opendaylight.org/browse/CONTROLLER-1645 [2] https://jenkins.opendaylight.org/releng/view/ bgpcep/job/bgpcep-csit-3node-bgpclustering-longevity-only-oxygen/ [3] https://jenkins.opendaylight.org/releng/view/ bgpcep/job/bgpcep-csit-3node-periodic-bgpclustering-all-oxygen/ [4] https://jira.opendaylight.org/browse/CONTROLLER-1715 Regards Ajay > Regards, > Robert > > _______________________________________________ > controller-dev mailing list > [email protected] > https://lists.opendaylight.org/mailman/listinfo/controller-dev >
_______________________________________________ controller-dev mailing list [email protected] https://lists.opendaylight.org/mailman/listinfo/controller-dev
