Re: [controller-dev] Should application code persist do retries on TransactionCommitFailedException caused by AskTimeoutException or could CDS be configured to retry more?

Ajay Lele Thu, 11 Jan 2018 22:49:44 -0800

On Thu, Jan 11, 2018 at 9:40 PM, Muthukumaran K <[email protected]
> wrote:


> Hi Sam, Robert,
>
> On the observations which were made as early as September 2017 -
> https://lists.opendaylight.org/pipermail/netvirt-dev/2017-
> September/005518.html (thanks to Jamo for testing this out)
> Enabling tell based protocol had 22% failure of CSIT at releng level. More
> details on the last sandbox and releng runs below
>
> Having said that, since this is a 3 month old result and multiple changes
> would have gone into netvirt + genius itself, it would be prudential to
> test the same with the latest Oxygen build (at least it would reduce the
> possibility of misinterpreting netvirt + genius related issues as MD-SAL
> related issues). We will do one more sandbox run here at Ericsson with
> latest ODL Master and re-publish the results with and without tell-based
> protocol enabled by mid of next week. We will also try to run one round of
> bulk-flow provisioning with OFPlugin's bulk-o-matic test driver to see the
> scale behavior of tell-based protocol too.
>
> Actually two runs were performed one on releng and another in sandbox
> between last week of August and mid of September 2017 against Nitrogen :
>
> Releng run :
> ==========
> https://logs.opendaylight.org/releng/jenkins092/netvirt-csit
> -3node-openstack-ocata-gate-stateful-nitrogen/7/log.html.gz
>
> Sandbox run :
> ===========
> https://logs.opendaylight.org/sandbox/jenkins091/netvirt-csi
> t-3node-openstack-ocata-jamo-upstream-stateful-nitrogen/1/
> odl1_karaf.log.gz
>
> Jamo's observations from sandbox run :
> results are not good. Looks like things pass from a black box perspective
> in our first l2 connectivity suite, but then lots of failures after that.
>
> I also notice that our non-failing keyword to write to the karaf log using
> ssh to the karaf shell is failing, even in the above passing suite.
>
> Also, it's worth noting that in order to enable tell-based protocol I'm
> just stealing a controller robot suite to do the work and running it first.
> It makes the config change and reboots all the controllers.
>
> In one karaf log (I only looked at one) I saw a bunch of WARN messages
> about "Unknown history .... ignoring..."
> example:
>
>   FrontendClientMetadataBuilder    | 215 - 
> org.opendaylight.controller.sal-distributed-datastore
> - 1.7.0.SNAPSHOT | member
> 1-shard-topology-operational: Unknown history for aborted transaction
> member-1-datastore-operational-fe-4-txn-7810-1, ignoring
>
> I also saw an ERROR about failure to serialize something or other:
>
> 2017-08-29 04:25:12,719 | ERROR | -dispatcher-3279 | EndpointWriter
>            | 41 - com.typesafe.akka.slf4j - 2.4.18
> | Failed to serialize remote message [class akka.actor.Status$Failure]
> | using serializer [class
> akka.serialization.JavaSerializer]. Transient association error
> (association remains live)
> akka.remote.MessageSerializer$SerializationException: Failed to serialize
> remote message [class akka.actor.Status$Failure] using serializer [class
> akka.serialization.JavaSerializer].
>
> Observations:
> ===========
>
> -----Original Message-----
> From: Robert Varga [mailto:[email protected]]
> Sent: Friday, January 12, 2018 2:11 AM
> To: Sam Hague
> Cc: Michael Vorburger; Muthukumaran K; Tom Pantelis; controller-dev;
> [email protected]; Kency Kurian
> Subject: Re: [controller-dev] Should application code persist do retries
> on TransactionCommitFailedException caused by AskTimeoutException or
> could CDS be configured to retry more?
>
> Regards
> Muthu
>
>
> On 11/01/18 21:26, Sam Hague wrote:
> > Robert,
> >
> > when you mention odlparent/yangtools integrated - what does that mean?
>
> I meant the yangtools-2.0.0 stuff needs to be merged up -- which obviously
> was delayed way longer than anticipated.
>
> > do we think that will happen for oxygen?
>
> I would love to have it in, but it does have potential to cause breakage
> -- hence I am afraid we are out of runway.
>
> > There are a number of clustering bugs open that all have
> > AskTimeoutException listed in the traces. I think the idea is the tell
> > based change will help and then we can dig deeper if the bugs still
> exist.
>
> Yup.
>
> > Muthu,
> >
> > how did your testing with tell for netvirt tests go? Were we safe
> > switching to it?
>
> *This* is the most critical question that needs to be answered. If netvirt
> and BGP greenlight it, I think we can make the switch ...
>

+bgpcep-dev

For BGP the last status seems to be the below:

1. BGPCEP-392: BGP scaling target not met in 3-node cluster [0]

2. CONTROLLER-1645: shard moved during 1M bgp prefix advertizing (with
tell-based=true) [1] -- see recent comments, this is occasionally seen with
300k prefixes as well

CSIT test corresponding to above issues is currently red on oxygen [2] [3]

2. CONTROLLER-1715: 6 GB heap is not entirely enough for BGP ingest test
with 1 million prefixes when tell-based protocol is used [4]

[0] https://jira.opendaylight.org/browse/BGPCEP-392
[1] https://jira.opendaylight.org/browse/CONTROLLER-1645
[2] https://jenkins.opendaylight.org/releng/view/
bgpcep/job/bgpcep-csit-3node-bgpclustering-longevity-only-oxygen/
[3] https://jenkins.opendaylight.org/releng/view/
bgpcep/job/bgpcep-csit-3node-periodic-bgpclustering-all-oxygen/
[4] https://jira.opendaylight.org/browse/CONTROLLER-1715

Regards
Ajay


> Regards,
> Robert
>
> _______________________________________________
> controller-dev mailing list
> [email protected]
> https://lists.opendaylight.org/mailman/listinfo/controller-dev
>

_______________________________________________
controller-dev mailing list
[email protected]
https://lists.opendaylight.org/mailman/listinfo/controller-dev

Re: [controller-dev] Should application code persist do retries on TransactionCommitFailedException caused by AskTimeoutException or could CDS be configured to retry more?

Reply via email to