On Thu, Jun 9, 2016 at 3:01 PM, Michael Scherer <msche...@redhat.com> wrote: > Le jeudi 09 juin 2016 à 14:32 +0530, Kaushal M a écrit : >> On Thu, Jun 9, 2016 at 12:14 PM, Kaushal M <kshlms...@gmail.com> wrote: >> > On Thu, Jun 9, 2016 at 12:03 PM, Saravanakumar Arumugam >> > <sarum...@redhat.com> wrote: >> >> Hi Kaushal, >> >> >> >> One of the patch is failing for (http://review.gluster.org/#/c/14653/) is >> >> failing in NETBSD. >> >> Its log: >> >> https://build.gluster.org/job/rackspace-netbsd7-regression-triggered/15624/ >> >> >> >> But the patch mentioned in NETBSD is another >> >> one.(http://review.gluster.org/#/c/13872/) >> >> >> > >> > Yup. We know this is happening, but don't know why yet. I'll keep this >> > thread updated with any findings I have. >> > >> >> Thanks, >> >> Saravana >> >> >> >> >> >> >> >> On 06/09/2016 11:52 AM, Kaushal M wrote: >> >>> >> >>> In addition to the builder issues we're having, we are also facing >> >>> problems with jenkins voting/commenting randomly. >> >>> >> >>> The comments generally link to older jobs for older patchsets, which >> >>> were run about 2 months back (beginning of April). For example, >> >>> https://review.gluster.org/14665 has a netbsd regression +1 vote, from >> >>> a job run in April for review 13873, and which actually failed. >> >>> >> >>> Another observation that I've made is that these fake votes sometime >> >>> provide a -1 Verified. Jenkins shouldn't be using this flag anymore. >> >>> >> >>> These 2 observations, make me wonder if another jenkins instance is >> >>> running somewhere, from our old backups possibly? Michael, could this >> >>> be possible? >> >>> >> >>> To check from where these votes/comments were coming from, I tried >> >>> checking the Gerrit sshd logs. This wasn't helpful, because all logins >> >>> apparently happen from 127.0.0.1. This is probably some firewall rule >> >>> that has been setup, post migration, because I see older logs giving >> >>> proper IPs. I'll require Michael's help with fixing this, if possible. >> >>> >> >>> I'll continue to investigate, and update this thread with anything I >> >>> find. >> >>> >> >> My guess was right!! >> >> This problem should now be fixed, as well as the problem with the builders. >> The cause for both is the same: our old jenkins server, back from the >> dead (zombie-jenkins from now on). >> >> The hypervisor in iWeb which hosted our services earlier, which was >> supposed to be off, >> had started up about 4 days back. This brought back zombie-jenkins. >> >> Zombie-jenkins continued from where is left off around early April. It >> started getting gerrit events, and started running jobs for them. >> Zombie-jenkins started numbering jobs from where it left off, and used >> these numbers when reporting back to gerrit. >> But these job numbers had already been used by new-jenkins about 2 >> months back when it started. >> This is why the links in the comments pointed to the old jobs in new-jenkins. >> I've checked logs on Gerrit (with help from Micheal) and can verify >> that these comments/votes did come zombie-jenkins's IP. >> >> Zombie-jenkins also explains the random build failures being seen on >> the builders. >> Zombie-jenkins and new-jenkins each thought they had the slaves to >> themselves and launched jobs on them, >> causing jobs to clash sometimes, which resulted in random failures >> reported in new-jenkins. >> I'm yet to login to a slave and verify this, but I'm pretty sure this >> what happened. >> >> For now, Michael has stopped the iWeb hypervisor and zombie-jenkins. >> This should stop anymore random comments in Gerrit and failures in Jenkins. > > Well, i just stopped the 3 VM, and disabled them on boot (both xen and > libvirt), so they should cause much trouble.
I hope something better than fire was used this time, it wasn't effective last time. > >> I'll get Michael (once he's back on Monday) to figure out why >> zombie-jenkins restarted, >> and write up a proper postmortem about the issues. > > Oh, that part is easy to guess. We did ask to iweb to stop the server, > that was supposed to happen around end of may (need to dig my mail) and > I guess they did. Log stop at 29 may. > > Then someone did see it was down and restarted it around the 4th of June > at 9h25. However, the server did seems to not have ntp running, so the > time was off by 4h, so I am not sure if someone started it at 9h25 EDT, > or 5h25 EDT. As the server is in Montreal, I would assume 9h25 is a > reasonable time, but then, losing 4h in 4 days is a bit concerning. > (at the same time, someone working at 5h in the morning would explain > why the wrong server got restarted, I am also not that fresh at that > time usually) > > Then, as the VM were configured to start on boot, so they all came back > ~ 4 days ago. > > I guess digging more requires us to contact iweb, which can be done (we > have 1 ex-iweb on rdo project, who still have good insider contacts) This should be enough for writing up the postmortem. I'm now trying to get proper evidence of zombie-jenkins causing the build failures. I'll write up the postmortem after that. > > -- > Michael Scherer > Sysadmin, Community Infrastructure and Platform, OSAS > > _______________________________________________ Gluster-infra mailing list Gluster-infra@gluster.org http://www.gluster.org/mailman/listinfo/gluster-infra