Re: [Gluster-infra] Putting the netbsd builders in the ansible pool ?
Michael Scherer wrote: > The only issue I face is that you flagged most of /usr as unchangeable, > and I do not know how cleanly it would be to remove the flags before > applying changes and apply that again with the current layout of our > ansible roles. But I will figure something out. I did this because of a glusterfs bug that overrote random file with logs. I tend to use it that way to overrite a file: cat hosts | ssh root@host "chflags nouchg /etc/hosts; cat > /etc/hosts; chflags uchg /etc/hosts" -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org ___ Gluster-infra mailing list Gluster-infra@gluster.org http://www.gluster.org/mailman/listinfo/gluster-infra
Re: [Gluster-infra] All NetBSD slaves look offline
Le jeudi 09 juin 2016 à 21:32 +0530, Atin Mukherjee a écrit : > I can see the recent jobs have failed with the following error: > > 08:51:29 Caused by: hudson.plugins.git.GitException: Command "git config > remote.origin.url git://review.gluster.org/glusterfs.git" returned > status code 4: > 08:51:29 stdout: > 08:51:29 stderr: error: failed to write new configuration file > .git/config.lock > So, not all seems to be down, quite the contrary. They can be marked as offline in the interface because jenkins put them offline when idle, but they would be coming back if needed. Now, maybe they were all ofline in the past (this doesn't appear in the interface), and someone brought them back. nbslave75.cloud.gluster.org and nbslave71.cloud.gluster.org have disk full, hence the error. nb71 is already offline. And for a reason I do not understand, nb75 was the one were all was scheduled. Either someone did take care of bringing the others online, or there is a bug. I am running a test to see, but to me, it look correct for now: https://build.gluster.org/job/rackspace-netbsd7-regression-triggered/17450/console So witout more info, I can't do much. -- Michael Scherer Sysadmin, Community Infrastructure and Platform, OSAS signature.asc Description: This is a digitally signed message part ___ Gluster-infra mailing list Gluster-infra@gluster.org http://www.gluster.org/mailman/listinfo/gluster-infra
Re: [Gluster-infra] All NetBSD slaves look offline
I can see the recent jobs have failed with the following error: 08:51:29 Caused by: hudson.plugins.git.GitException: Command "git config remote.origin.url git://review.gluster.org/glusterfs.git" returned status code 4: 08:51:29 stdout: 08:51:29 stderr: error: failed to write new configuration file .git/config.lock On 06/09/2016 09:31 PM, Atin Mukherjee wrote: ___ Gluster-infra mailing list Gluster-infra@gluster.org http://www.gluster.org/mailman/listinfo/gluster-infra
[Gluster-infra] All NetBSD slaves look offline
___ Gluster-infra mailing list Gluster-infra@gluster.org http://www.gluster.org/mailman/listinfo/gluster-infra
Re: [Gluster-infra] Investigating random votes in Gerrit
On Thu, Jun 9, 2016 at 3:01 PM, Michael Scherer wrote: > Le jeudi 09 juin 2016 à 14:32 +0530, Kaushal M a écrit : >> On Thu, Jun 9, 2016 at 12:14 PM, Kaushal M wrote: >> > On Thu, Jun 9, 2016 at 12:03 PM, Saravanakumar Arumugam >> > wrote: >> >> Hi Kaushal, >> >> >> >> One of the patch is failing for (http://review.gluster.org/#/c/14653/) is >> >> failing in NETBSD. >> >> Its log: >> >> https://build.gluster.org/job/rackspace-netbsd7-regression-triggered/15624/ >> >> >> >> But the patch mentioned in NETBSD is another >> >> one.(http://review.gluster.org/#/c/13872/) >> >> >> > >> > Yup. We know this is happening, but don't know why yet. I'll keep this >> > thread updated with any findings I have. >> > >> >> Thanks, >> >> Saravana >> >> >> >> >> >> >> >> On 06/09/2016 11:52 AM, Kaushal M wrote: >> >>> >> >>> In addition to the builder issues we're having, we are also facing >> >>> problems with jenkins voting/commenting randomly. >> >>> >> >>> The comments generally link to older jobs for older patchsets, which >> >>> were run about 2 months back (beginning of April). For example, >> >>> https://review.gluster.org/14665 has a netbsd regression +1 vote, from >> >>> a job run in April for review 13873, and which actually failed. >> >>> >> >>> Another observation that I've made is that these fake votes sometime >> >>> provide a -1 Verified. Jenkins shouldn't be using this flag anymore. >> >>> >> >>> These 2 observations, make me wonder if another jenkins instance is >> >>> running somewhere, from our old backups possibly? Michael, could this >> >>> be possible? >> >>> >> >>> To check from where these votes/comments were coming from, I tried >> >>> checking the Gerrit sshd logs. This wasn't helpful, because all logins >> >>> apparently happen from 127.0.0.1. This is probably some firewall rule >> >>> that has been setup, post migration, because I see older logs giving >> >>> proper IPs. I'll require Michael's help with fixing this, if possible. >> >>> >> >>> I'll continue to investigate, and update this thread with anything I >> >>> find. >> >>> >> >> My guess was right!! >> >> This problem should now be fixed, as well as the problem with the builders. >> The cause for both is the same: our old jenkins server, back from the >> dead (zombie-jenkins from now on). >> >> The hypervisor in iWeb which hosted our services earlier, which was >> supposed to be off, >> had started up about 4 days back. This brought back zombie-jenkins. >> >> Zombie-jenkins continued from where is left off around early April. It >> started getting gerrit events, and started running jobs for them. >> Zombie-jenkins started numbering jobs from where it left off, and used >> these numbers when reporting back to gerrit. >> But these job numbers had already been used by new-jenkins about 2 >> months back when it started. >> This is why the links in the comments pointed to the old jobs in new-jenkins. >> I've checked logs on Gerrit (with help from Micheal) and can verify >> that these comments/votes did come zombie-jenkins's IP. >> >> Zombie-jenkins also explains the random build failures being seen on >> the builders. >> Zombie-jenkins and new-jenkins each thought they had the slaves to >> themselves and launched jobs on them, >> causing jobs to clash sometimes, which resulted in random failures >> reported in new-jenkins. >> I'm yet to login to a slave and verify this, but I'm pretty sure this >> what happened. >> >> For now, Michael has stopped the iWeb hypervisor and zombie-jenkins. >> This should stop anymore random comments in Gerrit and failures in Jenkins. > > Well, i just stopped the 3 VM, and disabled them on boot (both xen and > libvirt), so they should cause much trouble. I hope something better than fire was used this time, it wasn't effective last time. > >> I'll get Michael (once he's back on Monday) to figure out why >> zombie-jenkins restarted, >> and write up a proper postmortem about the issues. > > Oh, that part is easy to guess. We did ask to iweb to stop the server, > that was supposed to happen around end of may (need to dig my mail) and > I guess they did. Log stop at 29 may. > > Then someone did see it was down and restarted it around the 4th of June > at 9h25. However, the server did seems to not have ntp running, so the > time was off by 4h, so I am not sure if someone started it at 9h25 EDT, > or 5h25 EDT. As the server is in Montreal, I would assume 9h25 is a > reasonable time, but then, losing 4h in 4 days is a bit concerning. > (at the same time, someone working at 5h in the morning would explain > why the wrong server got restarted, I am also not that fresh at that > time usually) > > Then, as the VM were configured to start on boot, so they all came back > ~ 4 days ago. > > I guess digging more requires us to contact iweb, which can be done (we > have 1 ex-iweb on rdo project, who still have good insider contacts) This should be enough for writing up the postmortem. I'm now trying to get proper evidence of
Re: [Gluster-infra] Investigating random votes in Gerrit
Le jeudi 09 juin 2016 à 14:32 +0530, Kaushal M a écrit : > On Thu, Jun 9, 2016 at 12:14 PM, Kaushal M wrote: > > On Thu, Jun 9, 2016 at 12:03 PM, Saravanakumar Arumugam > > wrote: > >> Hi Kaushal, > >> > >> One of the patch is failing for (http://review.gluster.org/#/c/14653/) is > >> failing in NETBSD. > >> Its log: > >> https://build.gluster.org/job/rackspace-netbsd7-regression-triggered/15624/ > >> > >> But the patch mentioned in NETBSD is another > >> one.(http://review.gluster.org/#/c/13872/) > >> > > > > Yup. We know this is happening, but don't know why yet. I'll keep this > > thread updated with any findings I have. > > > >> Thanks, > >> Saravana > >> > >> > >> > >> On 06/09/2016 11:52 AM, Kaushal M wrote: > >>> > >>> In addition to the builder issues we're having, we are also facing > >>> problems with jenkins voting/commenting randomly. > >>> > >>> The comments generally link to older jobs for older patchsets, which > >>> were run about 2 months back (beginning of April). For example, > >>> https://review.gluster.org/14665 has a netbsd regression +1 vote, from > >>> a job run in April for review 13873, and which actually failed. > >>> > >>> Another observation that I've made is that these fake votes sometime > >>> provide a -1 Verified. Jenkins shouldn't be using this flag anymore. > >>> > >>> These 2 observations, make me wonder if another jenkins instance is > >>> running somewhere, from our old backups possibly? Michael, could this > >>> be possible? > >>> > >>> To check from where these votes/comments were coming from, I tried > >>> checking the Gerrit sshd logs. This wasn't helpful, because all logins > >>> apparently happen from 127.0.0.1. This is probably some firewall rule > >>> that has been setup, post migration, because I see older logs giving > >>> proper IPs. I'll require Michael's help with fixing this, if possible. > >>> > >>> I'll continue to investigate, and update this thread with anything I find. > >>> > > My guess was right!! > > This problem should now be fixed, as well as the problem with the builders. > The cause for both is the same: our old jenkins server, back from the > dead (zombie-jenkins from now on). > > The hypervisor in iWeb which hosted our services earlier, which was > supposed to be off, > had started up about 4 days back. This brought back zombie-jenkins. > > Zombie-jenkins continued from where is left off around early April. It > started getting gerrit events, and started running jobs for them. > Zombie-jenkins started numbering jobs from where it left off, and used > these numbers when reporting back to gerrit. > But these job numbers had already been used by new-jenkins about 2 > months back when it started. > This is why the links in the comments pointed to the old jobs in new-jenkins. > I've checked logs on Gerrit (with help from Micheal) and can verify > that these comments/votes did come zombie-jenkins's IP. > > Zombie-jenkins also explains the random build failures being seen on > the builders. > Zombie-jenkins and new-jenkins each thought they had the slaves to > themselves and launched jobs on them, > causing jobs to clash sometimes, which resulted in random failures > reported in new-jenkins. > I'm yet to login to a slave and verify this, but I'm pretty sure this > what happened. > > For now, Michael has stopped the iWeb hypervisor and zombie-jenkins. > This should stop anymore random comments in Gerrit and failures in Jenkins. Well, i just stopped the 3 VM, and disabled them on boot (both xen and libvirt), so they should cause much trouble. > I'll get Michael (once he's back on Monday) to figure out why > zombie-jenkins restarted, > and write up a proper postmortem about the issues. Oh, that part is easy to guess. We did ask to iweb to stop the server, that was supposed to happen around end of may (need to dig my mail) and I guess they did. Log stop at 29 may. Then someone did see it was down and restarted it around the 4th of June at 9h25. However, the server did seems to not have ntp running, so the time was off by 4h, so I am not sure if someone started it at 9h25 EDT, or 5h25 EDT. As the server is in Montreal, I would assume 9h25 is a reasonable time, but then, losing 4h in 4 days is a bit concerning. (at the same time, someone working at 5h in the morning would explain why the wrong server got restarted, I am also not that fresh at that time usually) Then, as the VM were configured to start on boot, so they all came back ~ 4 days ago. I guess digging more requires us to contact iweb, which can be done (we have 1 ex-iweb on rdo project, who still have good insider contacts) -- Michael Scherer Sysadmin, Community Infrastructure and Platform, OSAS signature.asc Description: This is a digitally signed message part ___ Gluster-infra mailing list Gluster-infra@gluster.org http://www.gluster.org/mailman/listinfo/gluster-infra
Re: [Gluster-infra] Regression fails due to infra issue
On Wed, Jun 8, 2016 at 4:50 PM, Niels de Vos wrote: > On Wed, Jun 08, 2016 at 10:30:37AM +0200, Michael Scherer wrote: >> Le mercredi 08 juin 2016 à 03:15 +0200, Niels de Vos a écrit : >> > On Tue, Jun 07, 2016 at 10:29:34AM +0200, Michael Scherer wrote: >> > > Le mardi 07 juin 2016 à 10:00 +0200, Michael Scherer a écrit : >> > > > Le mardi 07 juin 2016 à 09:54 +0200, Michael Scherer a écrit : >> > > > > Le lundi 06 juin 2016 à 21:18 +0200, Niels de Vos a écrit : >> > > > > > On Mon, Jun 06, 2016 at 09:59:02PM +0530, Nigel Babu wrote: >> > > > > > > On Mon, Jun 6, 2016 at 12:56 PM, Poornima Gurusiddaiah >> > > > > > > >> > > > > > > wrote: >> > > > > > > >> > > > > > > > Hi, >> > > > > > > > >> > > > > > > > There are multiple issues that we saw with regressions lately: >> > > > > > > > >> > > > > > > > 1. On certain slaves the regression fails during build and i >> > > > > > > > see those on >> > > > > > > > slave26.cloud.gluster.org, slave25.cloud.gluster.org and may >> > > > > > > > be others >> > > > > > > > also. >> > > > > > > > Eg: >> > > > > > > > https://build.gluster.org/job/rackspace-regression-2GB-triggered/21422/console >> > > > > > > > >> > > > > > > >> > > > > > > Are you sure this isn't a code breakage? >> > > > > > >> > > > > > No, it really does not look like that. >> > > > > > >> > > > > > This is an other one, it seems the testcase got killed for some >> > > > > > reason: >> > > > > > >> > > > > > >> > > > > > https://build.gluster.org/job/rackspace-regression-2GB-triggered/21459/console >> > > > > > >> > > > > > It was running on slave25.cloud.gluster.org too... Is it possible >> > > > > > that >> > > > > > there is some watchdog or other configuration checking for >> > > > > > resources and >> > > > > > killing testcases on occasion? The number of slaves where this >> > > > > > happens >> > > > > > seems limited, were these more recently installed/configured? >> > > > > >> > > > > So dmesg speak of segfault in yum >> > > > > >> > > > > yum[2711] trap invalid opcode ip:7f2efac38d60 sp:7ffd77322658 >> > > > > error:0 in >> > > > > libfreeblpriv3.so[7f2efabe6000+72000] >> > > > > >> > > > > and >> > > > > https://access.redhat.com/solutions/2313911 >> > > > > >> > > > > That's exactly the problem. >> > > > > [root@slave25 ~]# /usr/bin/curl https://google.com >> > > > > Illegal instruction >> > > > > >> > > > > I propose to remove the builder from rotation while we investigate. >> > > > >> > > > Or we can: >> > > > >> > > > export NSS_DISABLE_HW_AES=1 >> > > > >> > > > to work around, cf the bug listed on the article. >> > > > >> > > > Not sure the best way to deploy that. >> > > >> > > So we are testing the fix on slave25, and if that's what fix the error, >> > > I will deploy to the whole gluster builders, and investigate for the non >> > > builders server. That's only for RHEL 6/Centos 6 on rackspace. >> > >> > If this does not work, configuring mock to use http (without the 's') >> > might be an option too. The export variable would probably need to get >> > set inside the mock chroot. It can possibly be done in >> > /etc/mock/site-defaults.cfg. >> > >> > For the normal test cases, placing the environment variable (and maybe >> > NSS_DISABLE_HW_GCM=1 too?) in the global bashrc might be sufficient. >> >> We used /etc/environment, and so far, no one complained about side >> effects. >> >> (I mean, this did fix stuff, right ? right ??? ) > > I dont know. This was the last job that failed due to the bug: > > https://build.gluster.org/job/glusterfs-devrpms/16978/console > > There are more recent ones on slave25 that failed due to unclear reasons > as well, not sure if that is caused by the same problem: > > https://build.gluster.org/computer/slave25.cloud.gluster.org/builds > > Thanks, > Niels The random build failures should now be fixed (or at least not happen anymore). Please refer to the mail-thread 'Investigating random votes in Gerrit' for more information. ~kaushal > > ___ > Gluster-infra mailing list > Gluster-infra@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-infra ___ Gluster-infra mailing list Gluster-infra@gluster.org http://www.gluster.org/mailman/listinfo/gluster-infra
Re: [Gluster-infra] Investigating random votes in Gerrit
On Thu, Jun 9, 2016 at 12:14 PM, Kaushal M wrote: > On Thu, Jun 9, 2016 at 12:03 PM, Saravanakumar Arumugam > wrote: >> Hi Kaushal, >> >> One of the patch is failing for (http://review.gluster.org/#/c/14653/) is >> failing in NETBSD. >> Its log: >> https://build.gluster.org/job/rackspace-netbsd7-regression-triggered/15624/ >> >> But the patch mentioned in NETBSD is another >> one.(http://review.gluster.org/#/c/13872/) >> > > Yup. We know this is happening, but don't know why yet. I'll keep this > thread updated with any findings I have. > >> Thanks, >> Saravana >> >> >> >> On 06/09/2016 11:52 AM, Kaushal M wrote: >>> >>> In addition to the builder issues we're having, we are also facing >>> problems with jenkins voting/commenting randomly. >>> >>> The comments generally link to older jobs for older patchsets, which >>> were run about 2 months back (beginning of April). For example, >>> https://review.gluster.org/14665 has a netbsd regression +1 vote, from >>> a job run in April for review 13873, and which actually failed. >>> >>> Another observation that I've made is that these fake votes sometime >>> provide a -1 Verified. Jenkins shouldn't be using this flag anymore. >>> >>> These 2 observations, make me wonder if another jenkins instance is >>> running somewhere, from our old backups possibly? Michael, could this >>> be possible? >>> >>> To check from where these votes/comments were coming from, I tried >>> checking the Gerrit sshd logs. This wasn't helpful, because all logins >>> apparently happen from 127.0.0.1. This is probably some firewall rule >>> that has been setup, post migration, because I see older logs giving >>> proper IPs. I'll require Michael's help with fixing this, if possible. >>> >>> I'll continue to investigate, and update this thread with anything I find. >>> My guess was right!! This problem should now be fixed, as well as the problem with the builders. The cause for both is the same: our old jenkins server, back from the dead (zombie-jenkins from now on). The hypervisor in iWeb which hosted our services earlier, which was supposed to be off, had started up about 4 days back. This brought back zombie-jenkins. Zombie-jenkins continued from where is left off around early April. It started getting gerrit events, and started running jobs for them. Zombie-jenkins started numbering jobs from where it left off, and used these numbers when reporting back to gerrit. But these job numbers had already been used by new-jenkins about 2 months back when it started. This is why the links in the comments pointed to the old jobs in new-jenkins. I've checked logs on Gerrit (with help from Micheal) and can verify that these comments/votes did come zombie-jenkins's IP. Zombie-jenkins also explains the random build failures being seen on the builders. Zombie-jenkins and new-jenkins each thought they had the slaves to themselves and launched jobs on them, causing jobs to clash sometimes, which resulted in random failures reported in new-jenkins. I'm yet to login to a slave and verify this, but I'm pretty sure this what happened. For now, Michael has stopped the iWeb hypervisor and zombie-jenkins. This should stop anymore random comments in Gerrit and failures in Jenkins. I'll get Michael (once he's back on Monday) to figure out why zombie-jenkins restarted, and write up a proper postmortem about the issues. >>> ~kaushal >>> ___ >>> Gluster-infra mailing list >>> Gluster-infra@gluster.org >>> http://www.gluster.org/mailman/listinfo/gluster-infra >> >> ___ Gluster-infra mailing list Gluster-infra@gluster.org http://www.gluster.org/mailman/listinfo/gluster-infra
Re: [Gluster-infra] Putting the netbsd builders in the ansible pool ?
Le jeudi 09 juin 2016 à 02:09 +0200, Emmanuel Dreyfus a écrit : > Michael Scherer wrote: > > > I connected to it from rackspace and stopped rpcbind in a hurry after > > being paged, but I would like to make sure that the netbsd builders are > > a bit more hardened (even if they are already well hardened from what I > > did see, even if there is no firewall), as it seems most of them are > > also running rpcbind (and sockstat show they are not listening only on > > localhost). > > I created minimal filtering rules in /etc/ipf.conf and restarted > rpcbind. I did the same for others NetBSD vm. ok, great. I did it too for the freebsd builder. > > Emmanuel, would you be ok if we start to manage them with ansible like > > we do for the Centos ones ? > > I have no problem with it, but I must confess a complete lack of > experience with this tool. That's mostly deploy script with ssh. The only issue I face is that you flagged most of /usr as unchangeable, and I do not know how cleanly it would be to remove the flags before applying changes and apply that again with the current layout of our ansible roles. But I will figure something out. -- Michael Scherer Sysadmin, Community Infrastructure and Platform, OSAS signature.asc Description: This is a digitally signed message part ___ Gluster-infra mailing list Gluster-infra@gluster.org http://www.gluster.org/mailman/listinfo/gluster-infra