Re: [controller-dev] [release] [OpenDaylight TSC] Oxygen SR4 Status build 2018-12-06

2018-12-10 Thread Daniel Farrell
I don't think we need a re-spin, since it's just a unit test change.

I'm not super sure it made sense to merge this, since we're about to
disable Oxygen jobs, but it also shouldn't be a big problem.

Thanks,
Daniel

On Mon, Dec 10, 2018 at 5:12 PM Jamo Luhrsen  wrote:

> wait, we merged a new oxygen patch and are doing a re-spin? I thought we
> only did that for release blockers?
>
> that means we'll have to get everyone to sign off on the csit results
> again.
>
> JamO
>
> On 12/10/18 4:14 AM, Sam Hague wrote:
> > CSIT won't verify this, it is a change to UT code for an intermittent
> failure. If the build passes I think we are fine.
> >
> > On Mon, Dec 10, 2018, 4:38 AM Anil Belur   wrote:
> >
> >
> > On Mon, Dec 10, 2018 at 5:11 PM Aswin Suryanarayanan <
> asury...@redhat.com > wrote:
> >
> >
> >
> > On Sat, Dec 8, 2018 at 9:00 PM Jamo Luhrsen  > wrote:
> >
> >
> >
> > On 12/8/18 3:19 AM, Ariel Adam wrote:
> >  > Hi everyone.
> >  > So our current situation is somewhat challenging with
> releasing Oxygen SR4 #509.
> >  >
> >  > On the one hand we have a number of pending and blocking
> jobs:
> >  >
> >  >   * *COE* (pending) - */Prem/*, please take a look
> >  >   * *Controller* (pending) - */Tom/*, please take a look
> >  >   * *Groupbasedpolicy* (pending) - */Michal/*, please
> take a look
> >  >   * *Netvirt* (blocking) - */Jamo/Sam/*, do we know why
> the failures happened?
> >  >   * *sfc* (blocking) - */Jamo/Brady/*, do we know what
> the failure happened?
> >
> > Makes sure you check the notes column too. I try to keep
> that up to date with
> > the jobs I'm investigating. At this point, I think I've been
> able to sign off
> > on all that I was tracking. There was tons of infra and 3rd
> party software
> > troubles recently that caused a lot of aborts and failures.
> We've been having
> > to re-run and fix things as we found them.
> >
> >  >
> >  > On the other hand builds #511 and #512 failed each due to
> different reasons:
> >  >
> >  >   * Build #511 due to a possible environment issues
> (*Anil/Thanh* to comment)
> >
> > yeah, looks like it to me too:
> >
> > (https://nexus.opendaylight.org/content/repositories/public/):
> GET request of:
> >
>  org/hibernate/hibernate-core/5.2.6.Final/hibernate-core-5.2.6.Final.jar
> from opendaylight-mirror failed:
> > Premature end
> > of Content-Length delimited message body (expected: 6550533;
> received: 192424 -> [Help 1]
> >
> >
> >  >   * Build #512 due to a problem in Netvirt (*Jamo/Sam* to
> comment)
> >
> > sorry, not sure on this one. Adding Aswin
> >
> > Adding Kiran,
> >
> > As Kiran pointed out in different thread there is an issue in
> aclservice tests which should be fixed by [0]. But
> > it is not yet merged.
> >
> > [0] : https://git.opendaylight.org/gerrit/#/c/78412/
> >
> >
> >
> > The change is merged, however I don't see this marked as a blocker
> on the tracking sheet.
> > Do we need to re-trigger AR or can we just run the CSIT job to
> condirm the fix? Please advice.
> >
> > Thanks,
> > Anil
> >
> >
> ___
> release mailing list
> rele...@lists.opendaylight.org
> https://lists.opendaylight.org/mailman/listinfo/release
>
___
controller-dev mailing list
controller-dev@lists.opendaylight.org
https://lists.opendaylight.org/mailman/listinfo/controller-dev


Re: [controller-dev] [release] Autorelease oxygen failed to build sal-cluster-admin-impl from controller

2018-07-20 Thread Daniel Farrell
On Fri, Jul 20, 2018 at 10:36 AM Thanh Ha 
wrote:

> On Fri, Jul 20, 2018 at 10:01 AM Tom Pantelis 
> wrote:
>
>> On Fri, Jul 20, 2018 at 4:48 AM, Anil Belur 
>> wrote:
>>
>>> On Fri, Jul 20, 2018 at 11:12 AM Jenkins <
>>> jenkins-dontre...@opendaylight.org> wrote:
>>>
 Attention controller-devs,

 Autorelease oxygen failed to build sal-cluster-admin-impl from
 controller in build
 359. Attached is a snippet of the error message related to the
 failure that we were able to automatically parse as well as console
 logs.

 Console Logs:

 https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/autorelease-release-oxygen/359

 Jenkins Build:

 https://jenkins.opendaylight.org/releng/job/autorelease-release-oxygen/359/

 Please review and provide an ETA on when a fix will be available.

 Thanks,
 ODL releng/autorelease team

>>>  Hello controller-dev:
>>>
>>> Please look into these failed tests.
>>>
>>> Failed tests:
>>>
>>> ClusterAdminRpcServiceTest.testFlipMemberVotingStates:976->lambda$testFlipMemberVotingStates$8:978
>>> Expected leader member-1. Actual:
>>> member-1-shard-cars-oper_testFlipMemberVotingStates
>>>
>>> Tests run: 17, Failures: 1, Errors: 0, Skipped: 0
>>>
>>
>>
>> I ran it successfully 500 times locally. But looking at the code and the
>> test output from jenkins, I can see why it failed - just the right
>> timing sequence coupled with just enough of a random thread execution delay
>> and a deadline timeout set by the test being just a tad too low for that
>> delay.  I'll push a patch. Another case where occasionally it seems there's
>> just enough of a slight delay or slowdown in the jenkins environment to
>> throw off timing to cause a test failure.
>>
>
> Hi Tom,
>
> I'm curious when you said you ran it successfully 500 times locally did
> you perform a full build during that time or tested the single test case in
> isolation?
>
> I found that while troubleshooting the bgpcep issue in the bgp-bmp-mock
> thread [0] that I had to run a full bgpcep build in order to reproduce the
> issue on my own laptop system. I have a script that I'm testing now and
> making it more generic that I will share to this list later which will
> allow us to continuously run builds whether it's autorelease or project
> specifc over and over infinitely and capture the maven output + surefire
> logs output which I hope will help folks reproduce intermittent issues
> locally.
>
> I feel like blaming infrastructure being "slow" is too easy an excuse for
> issues. If the software was run in a customer production environment I
> suspect telling the customer that their hardware is too slow and is not the
> same hardware as the developer's laptop it would not be a solution the
> customer would be happy with.
>

+1000

Our code and tests need to be robust enough to handle diverse
infrastructure. Bugs like this might be highlighted by infra variability,
but they are still bugs in code/tests.

Not picking on TomP or Controller here, this is a general ODL culture
problem of blaming the infra first and until Thanh/Jamo/et al prove
otherwise.


> I'm not sure what we can do to help give more confidence in the
> infrastructure so that it's not the first thing that gets blamed every time
> there's a build issue but we do run on build flavors in vexxhost that
> provide dedicated CPUs and RAM to our builders. Once I have some more
> validation on the infinite build script maybe I can run it for awhile on
> every autorelease managed project and report to the projects with the
> script output on my 2 laptops + a few vexxhost instances.
>

Thanks for this work Thanh, seems like it will be very helpful for ironing
out intermittent failures.

Daniel


>
> Regards,
> Thanh
>
> [0] https://lists.opendaylight.org/pipermail/release/2018-July/015594.html
>
> ___
> release mailing list
> rele...@lists.opendaylight.org
> https://lists.opendaylight.org/mailman/listinfo/release
>
___
controller-dev mailing list
controller-dev@lists.opendaylight.org
https://lists.opendaylight.org/mailman/listinfo/controller-dev


Re: [controller-dev] [release] Autorelease oxygen failed to build sal-clustering-commons from controller

2018-03-16 Thread Daniel Farrell
Some test related to persistence timed out after three seconds?

Anything we can do about this? Even if it's "random"/"intermittent" I think
we need to start taking those types of failures very seriously, as we seem
to have so many "sporadic" failures we're extremely likely to hit one of
them in a given build.

On Fri, Mar 16, 2018 at 10:46 AM Jenkins 
wrote:

> Attention controller-devs,
>
> Autorelease oxygen failed to build sal-clustering-commons from controller
> in build
> 221. Attached is a snippet of the error message related to the
> failure that we were able to automatically parse as well as console logs.
>
>
> Console Logs:
>
> https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/autorelease-release-oxygen/221
>
> Jenkins Build:
> https://jenkins.opendaylight.org/releng/job/autorelease-release-oxygen/221/
>
> Please review and provide an ETA on when a fix will be available.
>
> Thanks,
> ODL releng/autorelease team
>
> ___
> release mailing list
> rele...@lists.opendaylight.org
> https://lists.opendaylight.org/mailman/listinfo/release
>
___
controller-dev mailing list
controller-dev@lists.opendaylight.org
https://lists.opendaylight.org/mailman/listinfo/controller-dev


Re: [controller-dev] [release] Autorelease oxygen failed to build sal-binding-it from controller

2018-03-16 Thread Daniel Farrell
Is this the relevant error?

[ERROR]
test(org.opendaylight.controller.test.sal.binding.it.DataServiceIT)  Time
elapsed: 195.804 s  <<< ERROR!
org.ops4j.pax.swissbox.tracker.ServiceLookupException: gave up waiting for
service org.ops4j.pax.exam.ProbeInvoker

Anything we can do about it?

Thanks,
Daniel

On Thu, Mar 15, 2018 at 9:45 PM Jenkins 
wrote:

> Attention controller-devs,
>
> Autorelease oxygen failed to build sal-binding-it from controller in build
> 219. Attached is a snippet of the error message related to the
> failure that we were able to automatically parse as well as console logs.
>
>
> Console Logs:
>
> https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/autorelease-release-oxygen/219
>
> Jenkins Build:
> https://jenkins.opendaylight.org/releng/job/autorelease-release-oxygen/219/
>
> Please review and provide an ETA on when a fix will be available.
>
> Thanks,
> ODL releng/autorelease team
>
> ___
> release mailing list
> rele...@lists.opendaylight.org
> https://lists.opendaylight.org/mailman/listinfo/release
>
___
controller-dev mailing list
controller-dev@lists.opendaylight.org
https://lists.opendaylight.org/mailman/listinfo/controller-dev


Re: [controller-dev] Best way to gracefully shutdown Karaf in ODL context

2017-10-13 Thread Daniel Farrell
Hey Muthu,

Yes, I think you should take a look at the systemd configuration we ship in
ODL's packages. As far as I know it does a good job of
starting/stopping/restarting ODL's service.

https://git.opendaylight.org/gerrit/gitweb?p=integration/packaging.git;a=blob;f=packages/rpm/unitfiles/opendaylight.service;h=ac436592d2880047986b856c7dd6810665ba0d3e;hb=refs/heads/master

Here's a Nitrogen RPM that contains that systemd config:

http://cbs.centos.org/repos/nfv7-opendaylight-70-release/x86_64/os/Packages/opendaylight-7.0.0-1.el7.noarch.rpm

This test job shows examples of `sudo systemctl [start, stop, status]`
working:

https://jenkins.opendaylight.org/releng/job/packaging-test-rpm-master

The logic for that job is here:

https://git.opendaylight.org/gerrit/gitweb?p=releng/builder.git;a=blob;f=jjb/packaging/packaging.yaml;h=e4de235ca543506063b7fb57c3d257f0b983abe3;hb=refs/heads/master#l346

That systemd config is also exercised in tests for puppet-opendaylight,
ansible-opendaylight, OPNFV Apex and other OPNFV installers.

It seems like you've put some good thought into this, so if you have any
suggestions for things we can do better please let us know. :)

Daniel

On Thu, Oct 12, 2017 at 11:47 AM Jamo Luhrsen  wrote:

> +Daniel and Integration-dev,
>
> Daniel,
>
> does our rpm package and the systemd work you did for it answer any of
> Muthu's
> questions below? I'm assuming it *IS* the answer, but you will know better.
>
> Thanks,
> JamO
>
> On 10/12/2017 04:56 AM, Muthukumaran K wrote:
> > Hi,
> >
> > * *
> >
> > *Context* : Figuring out the best possible way to gracefully shutdown
> Karaf process using standard Karaf commands.
> >
> > This would be required because framework-level shutdown-sequence in
> Karaf would give opportunity framework to properly
> > execute bundle lifecycle listeners. What I mean is – abrupt kill can
> potentially prevent lifecycle listeners from being
> > properly executed and may also impact any inflight transactions which
> may be in various stages of replication and/or commit
> > phases. This can in turn lead to troubles during recovery / restart
> phase.
> >
> >
> >
> > So, I thought of middle-ground where
> >
> > 1)  We execute karaf stop followed by
> >
> > 2)  Periodic check  if the last PID indeed terminates
> >
> >
> >
> > Doing a straight kill -9 could lead to rare heisenbugs during wherein
> recovery could suffer since there may not be room for
> > lifecycle listeners to execute (unless Karaf handles it as unified
> shutdownhook and execute same path as that of stop or any
> > graceful shutdown methods)
> >
> >
> >
> > Have anybody tried any better methods without side-effects ?
> >
> >
> >
> >
> >
> > *Option was tried and observation is as follows *
> >
> > Using Karaf stop followed by Karaf status command to check if the
> process has come to a graceful termination. But, it appears
> > that though ‘status’ command reports Karaf instance as ‘Not Running’,
> the PID still lingers for 2 to 3 mins roughly in ODL
> > context. I am biased to think that there are indeed some lifecycle
> listeners executing … During this ‘PID lingering’ phase,
> > the thread-dump hints the System Bundle Shutdown is waiting for the BP
> container to shutdown the components (probably
> > executing the lifecycle listeners at application and platform levels)
> >
> >
> >
> > "System Bundle Shutdown" #1582 daemon prio=5 os_prio=0
> tid=0x7fb05003d800 nid=0xe68 waiting on condition [0x7faf77678000]
> >
> >java.lang.Thread.State: TIMED_WAITING (parking)
> >
> > at sun.misc.Unsafe.park(Native Method)
> >
> > - parking to wait for  <0xe9064250> (a
> com.google.common.util.concurrent.AbstractFuture$Sync)
> >
> > at
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
> >
> > at
> >
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
> >
> > at
> >
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
> >
> > at
> com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:268)
> >
> > at
> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:96)
> >
> > at org.opendaylight.openflowplugin.openflow.md
> .core.MDController.stop(MDController.java:358)
> >
> > at
> > org.opendaylight.openflowplugin.openflow.md
> .core.sal.OpenflowPluginProvider.close(OpenflowPluginProvider.java:121)
> >
> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> Method)
> >
> > at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> >
> > at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> >
> > at java.lang.reflect.Method.invoke(Method.ja