Re: [controller-dev] Should application code persist do retries on TransactionCommitFailedException caused by AskTimeoutException or could CDS be configured to retry more?

Muthukumaran K Wed, 17 Jan 2018 22:25:29 -0800

Quick Summary of the functional CSIT Runs for Netvirt cases with tell-based 
proto = ON as well as OFF conditions.

We are triggering one more set of test with and without tell-based protoc 
enabled because some failures in first set indicates that the VMs themselves 
were not reachable via SSH. So, in order to give benefit-of-doubt to tell-based 
protocol and to check if there are orthogonal issues in the test environment. 
We will publish those results too.

Anyway, Ajay’s observations with BGP-PCEP indicates that it would not be 
prudent to enable tell-based protocol as at now and it requires more deeper 
assessments.

Tell Based proto = ON
Results : 
https://jenkins.opendaylight.org/sandbox/job/netvirt-csit-3node-openstack-ocata-gate-stateful-oxygen/1/
Pass Rate : 68.8 %

Tell Based proto = OFF
Results : 
https://jenkins.opendaylight.org/sandbox/job/netvirt-csit-3node-openstack-ocata-gate-stateful-oxygen/2/
Pass Rate : 84.9 %

Quick observations from the karaf logs on first set of run :

1)      Node Unreachability followed by association-failure is as frequent in 
both runs

2)      But in case of Tell-Based proto scenario, association-failure triggers 
more frequent re-election (not sure why this is not so rampant because 
association-failure potentially triggering election is orthogonal to tell / ask 
enablement)

3)      As consequence of (2), determining the routee shard times-out very 
frequently in case of tell-based protoc

Will come back with observations on second set of runs soon

Regards
Muthu

From: Ajay Lele [mailto:[email protected]]
Sent: Friday, January 12, 2018 12:19 PM
To: Muthukumaran K
Cc: Robert Varga; Sam Hague; controller-dev; 
[email protected]<mailto:[email protected]>; 
Kency Kurian; Bgpcep-Dev; Luis Gomez Palacios
Subject: Re: [controller-dev] Should application code persist do retries on 
TransactionCommitFailedException caused by AskTimeoutException or could CDS be 
configured to retry more?

On Thu, Jan 11, 2018 at 9:40 PM, Muthukumaran K 
<[email protected]<mailto:[email protected]>> wrote:
Hi Sam, Robert,

On the observations which were made as early as September 2017 - 
https://lists.opendaylight.org/pipermail/netvirt-dev/2017-September/005518.html 
(thanks to Jamo for testing this out)
Enabling tell based protocol had 22% failure of CSIT at releng level. More 
details on the last sandbox and releng runs below

Having said that, since this is a 3 month old result and multiple changes would 
have gone into netvirt + genius itself, it would be prudential to test the same 
with the latest Oxygen build (at least it would reduce the possibility of 
misinterpreting netvirt + genius related issues as MD-SAL related issues). We 
will do one more sandbox run here at Ericsson with latest ODL Master and 
re-publish the results with and without tell-based protocol enabled by mid of 
next week. We will also try to run one round of bulk-flow provisioning with 
OFPlugin's bulk-o-matic test driver to see the scale behavior of tell-based 
protocol too.

Actually two runs were performed one on releng and another in sandbox between 
last week of August and mid of September 2017 against Nitrogen :

Releng run :
==========
https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-3node-openstack-ocata-gate-stateful-nitrogen/7/log.html.gz

Sandbox run :
===========
https://logs.opendaylight.org/sandbox/jenkins091/netvirt-csit-3node-openstack-ocata-jamo-upstream-stateful-nitrogen/1/odl1_karaf.log.gz

Jamo's observations from sandbox run :
results are not good. Looks like things pass from a black box perspective in 
our first l2 connectivity suite, but then lots of failures after that.

I also notice that our non-failing keyword to write to the karaf log using ssh 
to the karaf shell is failing, even in the above passing suite.

Also, it's worth noting that in order to enable tell-based protocol I'm just 
stealing a controller robot suite to do the work and running it first.
It makes the config change and reboots all the controllers.

In one karaf log (I only looked at one) I saw a bunch of WARN messages about 
"Unknown history .... ignoring..."
example:

  FrontendClientMetadataBuilder    | 215 - 
org.opendaylight.controller.sa<http://org.opendaylight.controller.sa>l-distributed-datastore
 - 1.7.0.SNAPSHOT | member
1-shard-topology-operational: Unknown history for aborted transaction 
member-1-datastore-operational-fe-4-txn-7810-1, ignoring

I also saw an ERROR about failure to serialize something or other:

2017-08-29 04:25:12,719 | ERROR | -dispatcher-3279 | EndpointWriter             
      | 41 - com.typesafe.akka.slf4j - 2.4.18
| Failed to serialize remote message [class akka.actor.Status$Failure]
| using serializer [class
akka.serialization.JavaSerializer]. Transient association error (association 
remains live)
akka.remote.MessageSerializer$SerializationException: Failed to serialize 
remote message [class akka.actor.Status$Failure] using serializer [class 
akka.serialization.JavaSerializer].

Observations:
===========

-----Original Message-----
From: Robert Varga [mailto:[email protected]<mailto:[email protected]>]
Sent: Friday, January 12, 2018 2:11 AM
To: Sam Hague
Cc: Michael Vorburger; Muthukumaran K; Tom Pantelis; controller-dev; 
[email protected]<mailto:[email protected]>; 
Kency Kurian
Subject: Re: [controller-dev] Should application code persist do retries on 
TransactionCommitFailedException caused by AskTimeoutException or could CDS be 
configured to retry more?

Regards
Muthu

On 11/01/18 21:26, Sam Hague wrote:
> Robert,
>
> when you mention odlparent/yangtools integrated - what does that mean?

I meant the yangtools-2.0.0 stuff needs to be merged up -- which obviously was 
delayed way longer than anticipated.

> do we think that will happen for oxygen?

I would love to have it in, but it does have potential to cause breakage
-- hence I am afraid we are out of runway.

> There are a number of clustering bugs open that all have
> AskTimeoutException listed in the traces. I think the idea is the tell
> based change will help and then we can dig deeper if the bugs still exist.

Yup.

> Muthu,
>
> how did your testing with tell for netvirt tests go? Were we safe
> switching to it?

*This* is the most critical question that needs to be answered. If netvirt and 
BGP greenlight it, I think we can make the switch ...

+bgpcep-dev

For BGP the last status seems to be the below:

1. BGPCEP-392: BGP scaling target not met in 3-node cluster [0]

2. CONTROLLER-1645: shard moved during 1M bgp prefix advertizing (with 
tell-based=true) [1] -- see recent comments, this is occasionally seen with 
300k prefixes as well

CSIT test corresponding to above issues is currently red on oxygen [2] [3]

2. CONTROLLER-1715: 6 GB heap is not entirely enough for BGP ingest test with 1 
million prefixes when tell-based protocol is used [4]

[0] https://jira.opendaylight.org/browse/BGPCEP-392
[1] https://jira.opendaylight.org/browse/CONTROLLER-1645
[2] 
https://jenkins.opendaylight.org/releng/view/bgpcep/job/bgpcep-csit-3node-bgpclustering-longevity-only-oxygen/
[3] 
https://jenkins.opendaylight.org/releng/view/bgpcep/job/bgpcep-csit-3node-periodic-bgpclustering-all-oxygen/
[4] https://jira.opendaylight.org/browse/CONTROLLER-1715

Regards
Ajay

Regards,
Robert
_______________________________________________
controller-dev mailing list
[email protected]<mailto:[email protected]>
https://lists.opendaylight.org/mailman/listinfo/controller-dev

_______________________________________________
controller-dev mailing list
[email protected]
https://lists.opendaylight.org/mailman/listinfo/controller-dev

Re: [controller-dev] Should application code persist do retries on TransactionCommitFailedException caused by AskTimeoutException or could CDS be configured to retry more?

Reply via email to