Re: [controller-dev] [integration-dev] [mdsal-dev] 3node cluster regression in Carbon - since Jan 5th

Sam Hague Wed, 11 Jan 2017 08:42:02 -0800

On Wed, Jan 11, 2017 at 6:48 AM, Robert Varga <n...@hq.sk> wrote:

> On 01/10/2017 11:12 PM, Jamo Luhrsen wrote:
> >> All we need to agree is that if that cloud suite is failing - all
> relevant project should stop
> >> merging (even as a process and not by a gerrit mechanic lock) until we
> are back from regression.
> > we aren't totally stable enough yet, imho. We are very close though.
> >
> > <devils advocate>
> > however convincing these dependent projects to stop merging is asking a
> lot. Who says
> > md-sal or controller gives a hoot about the "cloud" stuff working for
> opendaylight. maybe
> > other ODL projects are still working fine and the assumption is our
> cloud projects are the
> > projects that need to fix themselves, while everyone else can continue
> to do work.
> > </devils advocate>
> >
>
> To expand it a bit -- the lower your offset number is, the more of these
> cloud-like-stuff you have in your downstream. Asking upstreams to drop
> everything and scramble to fix downstream issues -- whenever they
> surface and for whatever reason -- will mean that upstream will not get
> any meaningful work done.
>
> I will put it very bluntly: the root cause of the problem discussed is
> integration on snapshot versions. The proposal is to gate development on
> end-to-end tests. That leads to massive use of computing resources with
> the corresponding latency in development pipeline, as each patch needs
> to go through the full validation suite. To bring that point into
> perspective: would it be okay for offset-2 patches to be gated by OPNFV
> test suites?
>
We should separate out how we got here. netvirt has very complicated setups
that require end to end testing to validate patches. It is difficult to
write a unit test or integration test to validate the patches and we don't
have that framework - we have tried and nothing is there. That pushed us to
use CSIT. This is the same flow openstack, opnfv and other projects use.
This is a valid and reasonable use. This solves the gating problem.


What we have also found is that offset 0 and 1 projects are leaking
problems to later offsets. Those issues are caught by the netvirt CSIT
because it is very comprehensive. This makes the netvirt CSIT a de-facto
test suite. We recognize it is difficult for the lower offsets to have a
comprehensive set of tests covering all use cases and this is why the
netvirt CSIT ends up as the test verification.

In the absence of a comprehensive suite in the lower offsets we made the
suggestion of having a cloud test. Should it block the lower offsets?
Probably not. But the test could be used in a manner to highlight issues so
that we get community involvement early. But if the test doesn't stop
offset development, then we need active participation from the lower
offsets to watch the job and be proactive on defect resolution. We spent
two weeks recently pointing fingers trying to find the root cause. And this
has happened before so it is a real problem. This has been a issue in ODL
since the beginning. It is very difficult to troubleshoot issues outside
your project because there are so many dependencies.

>
> The proposed gating scenarios are okay for releases, not for individual
> patches. For that we need to move away from snapshots+autorelease to
> per-project release jobs. That work starts at leaf projects, which have
> to be able to cope with version bumps and version skew -- the first one
> being integration/distribution, which must not ensure it is pulling in
> exactly one version of each ODL artifact.
>
I can see this helping some. You still need a way to verify the use cases
to ensure when a lower offset releases something it will work. When the
lower offset releases, we can easily look at the test results to know if
there was an issue. I think you still have the pointing finger issue.
Depending on 10 projects, you have to find the smoking gun to get support.
This is usually dumped on the offset 2 project to go and look at all
patches in the lower offsets and one by one rule them out. Highlight the
offense and then get resolution.

The other item discussed in this thread are ssues for jenkins resources are
being resolved differently. We identified different areas to target:
- that all resources are not being used now. jcould api problems, hopefully
to be resolved using heat templates
- auditting existing jobs. Some jobs aren't meaningful, some jobs do too
much, some jobs could run less often
- low priority jobs starving higher priority jobs. priority is logical in
this sense so the logical priority needs to be reflected in the job
configuration

>
> Bye,
> Robert
>
>
> _______________________________________________
> controller-dev mailing list
> controller-dev@lists.opendaylight.org
> https://lists.opendaylight.org/mailman/listinfo/controller-dev
>
>

_______________________________________________
controller-dev mailing list
controller-dev@lists.opendaylight.org
https://lists.opendaylight.org/mailman/listinfo/controller-dev

Re: [controller-dev] [integration-dev] [mdsal-dev] 3node cluster regression in Carbon - since Jan 5th

Reply via email to