Hi Sam, Robert, 

On the observations which were made as early as September 2017 - 
https://lists.opendaylight.org/pipermail/netvirt-dev/2017-September/005518.html 
(thanks to Jamo for testing this out)
Enabling tell based protocol had 22% failure of CSIT at releng level. More 
details on the last sandbox and releng runs below

Having said that, since this is a 3 month old result and multiple changes would 
have gone into netvirt + genius itself, it would be prudential to test the same 
with the latest Oxygen build (at least it would reduce the possibility of 
misinterpreting netvirt + genius related issues as MD-SAL related issues). We 
will do one more sandbox run here at Ericsson with latest ODL Master and 
re-publish the results with and without tell-based protocol enabled by mid of 
next week. We will also try to run one round of bulk-flow provisioning with 
OFPlugin's bulk-o-matic test driver to see the scale behavior of tell-based 
protocol too. 

Actually two runs were performed one on releng and another in sandbox between 
last week of August and mid of September 2017 against Nitrogen : 

Releng run :
========== 
https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-3node-openstack-ocata-gate-stateful-nitrogen/7/log.html.gz

Sandbox run : 
===========
https://logs.opendaylight.org/sandbox/jenkins091/netvirt-csit-3node-openstack-ocata-jamo-upstream-stateful-nitrogen/1/odl1_karaf.log.gz

Jamo's observations from sandbox run :
results are not good. Looks like things pass from a black box perspective in 
our first l2 connectivity suite, but then lots of failures after that.

I also notice that our non-failing keyword to write to the karaf log using ssh 
to the karaf shell is failing, even in the above passing suite.

Also, it's worth noting that in order to enable tell-based protocol I'm just 
stealing a controller robot suite to do the work and running it first.
It makes the config change and reboots all the controllers.

In one karaf log (I only looked at one) I saw a bunch of WARN messages about 
"Unknown history .... ignoring..."
example:

  FrontendClientMetadataBuilder    | 215 - 
org.opendaylight.controller.sal-distributed-datastore - 1.7.0.SNAPSHOT | member
1-shard-topology-operational: Unknown history for aborted transaction 
member-1-datastore-operational-fe-4-txn-7810-1, ignoring

I also saw an ERROR about failure to serialize something or other:

2017-08-29 04:25:12,719 | ERROR | -dispatcher-3279 | EndpointWriter             
      | 41 - com.typesafe.akka.slf4j - 2.4.18
| Failed to serialize remote message [class akka.actor.Status$Failure] 
| using serializer [class
akka.serialization.JavaSerializer]. Transient association error (association 
remains live)
akka.remote.MessageSerializer$SerializationException: Failed to serialize 
remote message [class akka.actor.Status$Failure] using serializer [class 
akka.serialization.JavaSerializer].

Observations:
===========

-----Original Message-----
From: Robert Varga [mailto:[email protected]] 
Sent: Friday, January 12, 2018 2:11 AM
To: Sam Hague
Cc: Michael Vorburger; Muthukumaran K; Tom Pantelis; controller-dev; 
[email protected]; Kency Kurian
Subject: Re: [controller-dev] Should application code persist do retries on 
TransactionCommitFailedException caused by AskTimeoutException or could CDS be 
configured to retry more?

Regards
Muthu


On 11/01/18 21:26, Sam Hague wrote:
> Robert,
> 
> when you mention odlparent/yangtools integrated - what does that mean?

I meant the yangtools-2.0.0 stuff needs to be merged up -- which obviously was 
delayed way longer than anticipated.

> do we think that will happen for oxygen?

I would love to have it in, but it does have potential to cause breakage
-- hence I am afraid we are out of runway.

> There are a number of clustering bugs open that all have 
> AskTimeoutException listed in the traces. I think the idea is the tell 
> based change will help and then we can dig deeper if the bugs still exist.

Yup.

> Muthu,
> 
> how did your testing with tell for netvirt tests go? Were we safe 
> switching to it?

*This* is the most critical question that needs to be answered. If netvirt and 
BGP greenlight it, I think we can make the switch ...

Regards,
Robert

_______________________________________________
controller-dev mailing list
[email protected]
https://lists.opendaylight.org/mailman/listinfo/controller-dev

Reply via email to