Re: [Gluster-devel] [Gluster-infra] regression machines reporting slowly ? here is the reason ...

2016-04-25 Thread Emmanuel Dreyfus
On Sun, Apr 24, 2016 at 03:59:40PM +0200, Niels de Vos wrote:
> Well, slaves go into offline, and should be woken up when needed.
> However it seems that Jenkins fails to connect to many slaves :-/

Nothing new here. I tracked this kind of toruble with NetBSD slaves
and only got frustration as the result.

-- 
Emmanuel Dreyfus
m...@netbsd.org
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [Gluster-infra] regression machines reporting slowly ? here is the reason ...

2016-04-25 Thread Niels de Vos
On Mon, Apr 25, 2016 at 11:58:56AM +0200, Michael Scherer wrote:
> Le lundi 25 avril 2016 à 11:26 +0200, Michael Scherer a écrit :
> > Le lundi 25 avril 2016 à 11:12 +0200, Niels de Vos a écrit :
> > > On Mon, Apr 25, 2016 at 10:43:13AM +0200, Michael Scherer wrote:
> > > > Le dimanche 24 avril 2016 à 15:59 +0200, Niels de Vos a écrit :
> > > > > On Sun, Apr 24, 2016 at 04:22:55PM +0530, Prasanna Kalever wrote:
> > > > > > On Sun, Apr 24, 2016 at 7:11 AM, Vijay Bellur  
> > > > > > wrote:
> > > > > > > On Sat, Apr 23, 2016 at 9:30 AM, Prasanna Kalever 
> > > > > > >  wrote:
> > > > > > >> Hi all,
> > > > > > >>
> > > > > > >> Noticed our regression machines are reporting back really slow,
> > > > > > >> especially CentOs and Smoke
> > > > > > >>
> > > > > > >> I found that most of the slaves are marked offline, this could 
> > > > > > >> be the
> > > > > > >> biggest reasons ?
> > > > > > >>
> > > > > > >>
> > > > > > >
> > > > > > > Regression machines are scheduled to be offline if there are no 
> > > > > > > active
> > > > > > > jobs. I wonder if the slowness is related to LVM or related 
> > > > > > > factors as
> > > > > > > detailed in a recent thread?
> > > > > > >
> > > > > > 
> > > > > > Sorry, the previous mail was sent incomplete (blame some Gmail 
> > > > > > shortcut)
> > > > > > 
> > > > > > Hi Vijay,
> > > > > > 
> > > > > > Honestly I was not aware of this case where the machines move to
> > > > > > offline state by them self, I was only aware that they just go to 
> > > > > > idle
> > > > > > state,
> > > > > > Thanks for sharing that information. But we still need to reclaim 
> > > > > > most
> > > > > > of machines, Here are the reasons why each of them are offline.
> > > > > 
> > > > > Well, slaves go into offline, and should be woken up when needed.
> > > > > However it seems that Jenkins fails to connect to many slaves :-/
> > > > > 
> > > > > I've rebooted:
> > > > > 
> > > > >  - slave46
> > > > >  - slave28
> > > > >  - slave26
> > > > >  - slave25
> > > > >  - slave24
> > > > >  - slave23
> > > > >  - slave21
> > > > > 
> > > > > These all seem to have come up correctly after clicking the 'Lauch 
> > > > > slave
> > > > > agent' button on the slave's status page.
> > > > > 
> > > > > Remember that anyone with a Jankins account can reboot VMs. This most
> > > > > often is sufficient to get them working again. Just go to
> > > > > https://build.gluster.org/job/reboot-vm/ , login and press some 
> > > > > buttons.
> > > > > 
> > > > > One slave is in a weird status, maybe one of the tests overwrote the 
> > > > > ssh
> > > > > key?
> > > > > 
> > > > > [04/24/16 06:48:02] [SSH] Opening SSH connection to 
> > > > > slave29.cloud.gluster.org:22.
> > > > > ERROR: Failed to authenticate as jenkins. Wrong password. 
> > > > > (credentialId:c31bff89-36c0-4f41-aed8-7c87ba53621e/method:password)
> > > > > [04/24/16 06:48:04] [SSH] Authentication failed.
> > > > > hudson.AbortException: Authentication failed.
> > > > >   at 
> > > > > hudson.plugins.sshslaves.SSHLauncher.openConnection(SSHLauncher.java:1217)
> > > > >   at 
> > > > > hudson.plugins.sshslaves.SSHLauncher$2.call(SSHLauncher.java:711)
> > > > >   at 
> > > > > hudson.plugins.sshslaves.SSHLauncher$2.call(SSHLauncher.java:706)
> > > > >   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> > > > >   at 
> > > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> > > > >   at 
> > > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> > > > >   at java.lang.Thread.run(Thread.java:745)
> > > > > [04/24/16 06:48:04] Launch failed - cleaning up connection
> > > > > [04/24/16 06:48:05] [SSH] Connection closed.
> > > > > 
> > > > > Leaving slave29 as is, maybe one of our admins can have a look and see
> > > > > if it needs reprovisioning.
> > > > 
> > > > Seems slave29 was reinstalled and/or slightly damaged, it was no longer
> > > > in salt configuration, but I could connect as root. 
> > > > 
> > > > It should work better now, but please tell me if anything is incorrect
> > > > with it.
> > > 
> > > Hmm, not really. Launching the Jenkins slave agent in it through the
> > > webui still fails the same:
> > > 
> > >   https://build.gluster.org/computer/slave29.cloud.gluster.org/log
> > > 
> > > Maybe the "jenkins" user on the slave has the wrong password?
> > 
> > So, it seems first that he had the wrong host key, so I changed that. 
> > 
> > I am looking at what is wrong, so do not put it offline :)
> 
> So the script to update the /etc/hosts file was not run, so it was using
> the wrong ip.
> 
> Can we agree on getting ride of it now, since there is no need for it
> anymore ?

I guess so, DNS should be stable now, right?

> (then i will also remove the /etc/rax-reboot file from the various
> slaves, and maybe replace with a ansible based system)

rax-reboot is only 

Re: [Gluster-devel] [Gluster-infra] regression machines reporting slowly ? here is the reason ...

2016-04-25 Thread Michael Scherer
Le lundi 25 avril 2016 à 11:26 +0200, Michael Scherer a écrit :
> Le lundi 25 avril 2016 à 11:12 +0200, Niels de Vos a écrit :
> > On Mon, Apr 25, 2016 at 10:43:13AM +0200, Michael Scherer wrote:
> > > Le dimanche 24 avril 2016 à 15:59 +0200, Niels de Vos a écrit :
> > > > On Sun, Apr 24, 2016 at 04:22:55PM +0530, Prasanna Kalever wrote:
> > > > > On Sun, Apr 24, 2016 at 7:11 AM, Vijay Bellur  
> > > > > wrote:
> > > > > > On Sat, Apr 23, 2016 at 9:30 AM, Prasanna Kalever 
> > > > > >  wrote:
> > > > > >> Hi all,
> > > > > >>
> > > > > >> Noticed our regression machines are reporting back really slow,
> > > > > >> especially CentOs and Smoke
> > > > > >>
> > > > > >> I found that most of the slaves are marked offline, this could be 
> > > > > >> the
> > > > > >> biggest reasons ?
> > > > > >>
> > > > > >>
> > > > > >
> > > > > > Regression machines are scheduled to be offline if there are no 
> > > > > > active
> > > > > > jobs. I wonder if the slowness is related to LVM or related factors 
> > > > > > as
> > > > > > detailed in a recent thread?
> > > > > >
> > > > > 
> > > > > Sorry, the previous mail was sent incomplete (blame some Gmail 
> > > > > shortcut)
> > > > > 
> > > > > Hi Vijay,
> > > > > 
> > > > > Honestly I was not aware of this case where the machines move to
> > > > > offline state by them self, I was only aware that they just go to idle
> > > > > state,
> > > > > Thanks for sharing that information. But we still need to reclaim most
> > > > > of machines, Here are the reasons why each of them are offline.
> > > > 
> > > > Well, slaves go into offline, and should be woken up when needed.
> > > > However it seems that Jenkins fails to connect to many slaves :-/
> > > > 
> > > > I've rebooted:
> > > > 
> > > >  - slave46
> > > >  - slave28
> > > >  - slave26
> > > >  - slave25
> > > >  - slave24
> > > >  - slave23
> > > >  - slave21
> > > > 
> > > > These all seem to have come up correctly after clicking the 'Lauch slave
> > > > agent' button on the slave's status page.
> > > > 
> > > > Remember that anyone with a Jankins account can reboot VMs. This most
> > > > often is sufficient to get them working again. Just go to
> > > > https://build.gluster.org/job/reboot-vm/ , login and press some buttons.
> > > > 
> > > > One slave is in a weird status, maybe one of the tests overwrote the ssh
> > > > key?
> > > > 
> > > > [04/24/16 06:48:02] [SSH] Opening SSH connection to 
> > > > slave29.cloud.gluster.org:22.
> > > > ERROR: Failed to authenticate as jenkins. Wrong password. 
> > > > (credentialId:c31bff89-36c0-4f41-aed8-7c87ba53621e/method:password)
> > > > [04/24/16 06:48:04] [SSH] Authentication failed.
> > > > hudson.AbortException: Authentication failed.
> > > > at 
> > > > hudson.plugins.sshslaves.SSHLauncher.openConnection(SSHLauncher.java:1217)
> > > > at 
> > > > hudson.plugins.sshslaves.SSHLauncher$2.call(SSHLauncher.java:711)
> > > > at 
> > > > hudson.plugins.sshslaves.SSHLauncher$2.call(SSHLauncher.java:706)
> > > > at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> > > > at 
> > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> > > > at 
> > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> > > > at java.lang.Thread.run(Thread.java:745)
> > > > [04/24/16 06:48:04] Launch failed - cleaning up connection
> > > > [04/24/16 06:48:05] [SSH] Connection closed.
> > > > 
> > > > Leaving slave29 as is, maybe one of our admins can have a look and see
> > > > if it needs reprovisioning.
> > > 
> > > Seems slave29 was reinstalled and/or slightly damaged, it was no longer
> > > in salt configuration, but I could connect as root. 
> > > 
> > > It should work better now, but please tell me if anything is incorrect
> > > with it.
> > 
> > Hmm, not really. Launching the Jenkins slave agent in it through the
> > webui still fails the same:
> > 
> >   https://build.gluster.org/computer/slave29.cloud.gluster.org/log
> > 
> > Maybe the "jenkins" user on the slave has the wrong password?
> 
> So, it seems first that he had the wrong host key, so I changed that. 
> 
> I am looking at what is wrong, so do not put it offline :)

So the script to update the /etc/hosts file was not run, so it was using
the wrong ip.

Can we agree on getting ride of it now, since there is no need for it
anymore ?

(then i will also remove the /etc/rax-reboot file from the various
slaves, and maybe replace with a ansible based system)
-- 
Michael Scherer
Sysadmin, Community Infrastructure and Platform, OSAS




signature.asc
Description: This is a digitally signed message part
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-infra] regression machines reporting slowly ? here is the reason ...

2016-04-25 Thread Michael Scherer
Le lundi 25 avril 2016 à 11:12 +0200, Niels de Vos a écrit :
> On Mon, Apr 25, 2016 at 10:43:13AM +0200, Michael Scherer wrote:
> > Le dimanche 24 avril 2016 à 15:59 +0200, Niels de Vos a écrit :
> > > On Sun, Apr 24, 2016 at 04:22:55PM +0530, Prasanna Kalever wrote:
> > > > On Sun, Apr 24, 2016 at 7:11 AM, Vijay Bellur  
> > > > wrote:
> > > > > On Sat, Apr 23, 2016 at 9:30 AM, Prasanna Kalever 
> > > > >  wrote:
> > > > >> Hi all,
> > > > >>
> > > > >> Noticed our regression machines are reporting back really slow,
> > > > >> especially CentOs and Smoke
> > > > >>
> > > > >> I found that most of the slaves are marked offline, this could be the
> > > > >> biggest reasons ?
> > > > >>
> > > > >>
> > > > >
> > > > > Regression machines are scheduled to be offline if there are no active
> > > > > jobs. I wonder if the slowness is related to LVM or related factors as
> > > > > detailed in a recent thread?
> > > > >
> > > > 
> > > > Sorry, the previous mail was sent incomplete (blame some Gmail shortcut)
> > > > 
> > > > Hi Vijay,
> > > > 
> > > > Honestly I was not aware of this case where the machines move to
> > > > offline state by them self, I was only aware that they just go to idle
> > > > state,
> > > > Thanks for sharing that information. But we still need to reclaim most
> > > > of machines, Here are the reasons why each of them are offline.
> > > 
> > > Well, slaves go into offline, and should be woken up when needed.
> > > However it seems that Jenkins fails to connect to many slaves :-/
> > > 
> > > I've rebooted:
> > > 
> > >  - slave46
> > >  - slave28
> > >  - slave26
> > >  - slave25
> > >  - slave24
> > >  - slave23
> > >  - slave21
> > > 
> > > These all seem to have come up correctly after clicking the 'Lauch slave
> > > agent' button on the slave's status page.
> > > 
> > > Remember that anyone with a Jankins account can reboot VMs. This most
> > > often is sufficient to get them working again. Just go to
> > > https://build.gluster.org/job/reboot-vm/ , login and press some buttons.
> > > 
> > > One slave is in a weird status, maybe one of the tests overwrote the ssh
> > > key?
> > > 
> > > [04/24/16 06:48:02] [SSH] Opening SSH connection to 
> > > slave29.cloud.gluster.org:22.
> > > ERROR: Failed to authenticate as jenkins. Wrong password. 
> > > (credentialId:c31bff89-36c0-4f41-aed8-7c87ba53621e/method:password)
> > > [04/24/16 06:48:04] [SSH] Authentication failed.
> > > hudson.AbortException: Authentication failed.
> > >   at 
> > > hudson.plugins.sshslaves.SSHLauncher.openConnection(SSHLauncher.java:1217)
> > >   at 
> > > hudson.plugins.sshslaves.SSHLauncher$2.call(SSHLauncher.java:711)
> > >   at 
> > > hudson.plugins.sshslaves.SSHLauncher$2.call(SSHLauncher.java:706)
> > >   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> > >   at 
> > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> > >   at 
> > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> > >   at java.lang.Thread.run(Thread.java:745)
> > > [04/24/16 06:48:04] Launch failed - cleaning up connection
> > > [04/24/16 06:48:05] [SSH] Connection closed.
> > > 
> > > Leaving slave29 as is, maybe one of our admins can have a look and see
> > > if it needs reprovisioning.
> > 
> > Seems slave29 was reinstalled and/or slightly damaged, it was no longer
> > in salt configuration, but I could connect as root. 
> > 
> > It should work better now, but please tell me if anything is incorrect
> > with it.
> 
> Hmm, not really. Launching the Jenkins slave agent in it through the
> webui still fails the same:
> 
>   https://build.gluster.org/computer/slave29.cloud.gluster.org/log
> 
> Maybe the "jenkins" user on the slave has the wrong password?

So, it seems first that he had the wrong host key, so I changed that. 

I am looking at what is wrong, so do not put it offline :)

-- 
Michael Scherer
Sysadmin, Community Infrastructure and Platform, OSAS




signature.asc
Description: This is a digitally signed message part
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-infra] regression machines reporting slowly ? here is the reason ...

2016-04-25 Thread Niels de Vos
On Mon, Apr 25, 2016 at 10:43:13AM +0200, Michael Scherer wrote:
> Le dimanche 24 avril 2016 à 15:59 +0200, Niels de Vos a écrit :
> > On Sun, Apr 24, 2016 at 04:22:55PM +0530, Prasanna Kalever wrote:
> > > On Sun, Apr 24, 2016 at 7:11 AM, Vijay Bellur  wrote:
> > > > On Sat, Apr 23, 2016 at 9:30 AM, Prasanna Kalever  
> > > > wrote:
> > > >> Hi all,
> > > >>
> > > >> Noticed our regression machines are reporting back really slow,
> > > >> especially CentOs and Smoke
> > > >>
> > > >> I found that most of the slaves are marked offline, this could be the
> > > >> biggest reasons ?
> > > >>
> > > >>
> > > >
> > > > Regression machines are scheduled to be offline if there are no active
> > > > jobs. I wonder if the slowness is related to LVM or related factors as
> > > > detailed in a recent thread?
> > > >
> > > 
> > > Sorry, the previous mail was sent incomplete (blame some Gmail shortcut)
> > > 
> > > Hi Vijay,
> > > 
> > > Honestly I was not aware of this case where the machines move to
> > > offline state by them self, I was only aware that they just go to idle
> > > state,
> > > Thanks for sharing that information. But we still need to reclaim most
> > > of machines, Here are the reasons why each of them are offline.
> > 
> > Well, slaves go into offline, and should be woken up when needed.
> > However it seems that Jenkins fails to connect to many slaves :-/
> > 
> > I've rebooted:
> > 
> >  - slave46
> >  - slave28
> >  - slave26
> >  - slave25
> >  - slave24
> >  - slave23
> >  - slave21
> > 
> > These all seem to have come up correctly after clicking the 'Lauch slave
> > agent' button on the slave's status page.
> > 
> > Remember that anyone with a Jankins account can reboot VMs. This most
> > often is sufficient to get them working again. Just go to
> > https://build.gluster.org/job/reboot-vm/ , login and press some buttons.
> > 
> > One slave is in a weird status, maybe one of the tests overwrote the ssh
> > key?
> > 
> > [04/24/16 06:48:02] [SSH] Opening SSH connection to 
> > slave29.cloud.gluster.org:22.
> > ERROR: Failed to authenticate as jenkins. Wrong password. 
> > (credentialId:c31bff89-36c0-4f41-aed8-7c87ba53621e/method:password)
> > [04/24/16 06:48:04] [SSH] Authentication failed.
> > hudson.AbortException: Authentication failed.
> > at 
> > hudson.plugins.sshslaves.SSHLauncher.openConnection(SSHLauncher.java:1217)
> > at 
> > hudson.plugins.sshslaves.SSHLauncher$2.call(SSHLauncher.java:711)
> > at 
> > hudson.plugins.sshslaves.SSHLauncher$2.call(SSHLauncher.java:706)
> > at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> > at 
> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> > at 
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> > at java.lang.Thread.run(Thread.java:745)
> > [04/24/16 06:48:04] Launch failed - cleaning up connection
> > [04/24/16 06:48:05] [SSH] Connection closed.
> > 
> > Leaving slave29 as is, maybe one of our admins can have a look and see
> > if it needs reprovisioning.
> 
> Seems slave29 was reinstalled and/or slightly damaged, it was no longer
> in salt configuration, but I could connect as root. 
> 
> It should work better now, but please tell me if anything is incorrect
> with it.

Hmm, not really. Launching the Jenkins slave agent in it through the
webui still fails the same:

  https://build.gluster.org/computer/slave29.cloud.gluster.org/log

Maybe the "jenkins" user on the slave has the wrong password?

Thanks,
Niels


signature.asc
Description: PGP signature
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-infra] regression machines reporting slowly ? here is the reason ...

2016-04-25 Thread Michael Scherer
Le dimanche 24 avril 2016 à 15:59 +0200, Niels de Vos a écrit :
> On Sun, Apr 24, 2016 at 04:22:55PM +0530, Prasanna Kalever wrote:
> > On Sun, Apr 24, 2016 at 7:11 AM, Vijay Bellur  wrote:
> > > On Sat, Apr 23, 2016 at 9:30 AM, Prasanna Kalever  
> > > wrote:
> > >> Hi all,
> > >>
> > >> Noticed our regression machines are reporting back really slow,
> > >> especially CentOs and Smoke
> > >>
> > >> I found that most of the slaves are marked offline, this could be the
> > >> biggest reasons ?
> > >>
> > >>
> > >
> > > Regression machines are scheduled to be offline if there are no active
> > > jobs. I wonder if the slowness is related to LVM or related factors as
> > > detailed in a recent thread?
> > >
> > 
> > Sorry, the previous mail was sent incomplete (blame some Gmail shortcut)
> > 
> > Hi Vijay,
> > 
> > Honestly I was not aware of this case where the machines move to
> > offline state by them self, I was only aware that they just go to idle
> > state,
> > Thanks for sharing that information. But we still need to reclaim most
> > of machines, Here are the reasons why each of them are offline.
> 
> Well, slaves go into offline, and should be woken up when needed.
> However it seems that Jenkins fails to connect to many slaves :-/
> 
> I've rebooted:
> 
>  - slave46
>  - slave28
>  - slave26
>  - slave25
>  - slave24
>  - slave23
>  - slave21
> 
> These all seem to have come up correctly after clicking the 'Lauch slave
> agent' button on the slave's status page.
> 
> Remember that anyone with a Jankins account can reboot VMs. This most
> often is sufficient to get them working again. Just go to
> https://build.gluster.org/job/reboot-vm/ , login and press some buttons.
> 
> One slave is in a weird status, maybe one of the tests overwrote the ssh
> key?
> 
> [04/24/16 06:48:02] [SSH] Opening SSH connection to 
> slave29.cloud.gluster.org:22.
> ERROR: Failed to authenticate as jenkins. Wrong password. 
> (credentialId:c31bff89-36c0-4f41-aed8-7c87ba53621e/method:password)
> [04/24/16 06:48:04] [SSH] Authentication failed.
> hudson.AbortException: Authentication failed.
>   at 
> hudson.plugins.sshslaves.SSHLauncher.openConnection(SSHLauncher.java:1217)
>   at hudson.plugins.sshslaves.SSHLauncher$2.call(SSHLauncher.java:711)
>   at hudson.plugins.sshslaves.SSHLauncher$2.call(SSHLauncher.java:706)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> [04/24/16 06:48:04] Launch failed - cleaning up connection
> [04/24/16 06:48:05] [SSH] Connection closed.
> 
> Leaving slave29 as is, maybe one of our admins can have a look and see
> if it needs reprovisioning.

Seems slave29 was reinstalled and/or slightly damaged, it was no longer
in salt configuration, but I could connect as root. 

It should work better now, but please tell me if anything is incorrect
with it.
-- 
Michael Scherer
Sysadmin, Community Infrastructure and Platform, OSAS




signature.asc
Description: This is a digitally signed message part
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-infra] regression machines reporting slowly ? here is the reason ...

2016-04-24 Thread Prasanna Kalever
On Sun, Apr 24, 2016 at 7:29 PM, Niels de Vos  wrote:
> On Sun, Apr 24, 2016 at 04:22:55PM +0530, Prasanna Kalever wrote:
>> On Sun, Apr 24, 2016 at 7:11 AM, Vijay Bellur  wrote:
>> > On Sat, Apr 23, 2016 at 9:30 AM, Prasanna Kalever  
>> > wrote:
>> >> Hi all,
>> >>
>> >> Noticed our regression machines are reporting back really slow,
>> >> especially CentOs and Smoke
>> >>
>> >> I found that most of the slaves are marked offline, this could be the
>> >> biggest reasons ?
>> >>
>> >>
>> >
>> > Regression machines are scheduled to be offline if there are no active
>> > jobs. I wonder if the slowness is related to LVM or related factors as
>> > detailed in a recent thread?
>> >
>>
>> Sorry, the previous mail was sent incomplete (blame some Gmail shortcut)
>>
>> Hi Vijay,
>>
>> Honestly I was not aware of this case where the machines move to
>> offline state by them self, I was only aware that they just go to idle
>> state,
>> Thanks for sharing that information. But we still need to reclaim most
>> of machines, Here are the reasons why each of them are offline.
>
> Well, slaves go into offline, and should be woken up when needed.
> However it seems that Jenkins fails to connect to many slaves :-/
>
> I've rebooted:
>
>  - slave46
>  - slave28
>  - slave26
>  - slave25
>  - slave24
>  - slave23
>  - slave21
>
> These all seem to have come up correctly after clicking the 'Lauch slave
> agent' button on the slave's status page.
>
> Remember that anyone with a Jankins account can reboot VMs. This most
> often is sufficient to get them working again. Just go to
> https://build.gluster.org/job/reboot-vm/ , login and press some buttons.
>
> One slave is in a weird status, maybe one of the tests overwrote the ssh
> key?
>
> [04/24/16 06:48:02] [SSH] Opening SSH connection to 
> slave29.cloud.gluster.org:22.
> ERROR: Failed to authenticate as jenkins. Wrong password. 
> (credentialId:c31bff89-36c0-4f41-aed8-7c87ba53621e/method:password)
> [04/24/16 06:48:04] [SSH] Authentication failed.
> hudson.AbortException: Authentication failed.
> at 
> hudson.plugins.sshslaves.SSHLauncher.openConnection(SSHLauncher.java:1217)
> at hudson.plugins.sshslaves.SSHLauncher$2.call(SSHLauncher.java:711)
> at hudson.plugins.sshslaves.SSHLauncher$2.call(SSHLauncher.java:706)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> [04/24/16 06:48:04] Launch failed - cleaning up connection
> [04/24/16 06:48:05] [SSH] Connection closed.
>
> Leaving slave29 as is, maybe one of our admins can have a look and see
> if it needs reprovisioning.

That's really cool Neils, thank you!

It will be helpful if somebody with Jenkins login perms can reboot
netbsd slave nbslave72.cloud.gluster.org ?

the below mentioned netbsd slaves were marked as offline
intentionally, just in case if forget to restore the state to online
(Please ignore if they still needed for some other jobs or has some issues)


Kaushal :
nbslave74.cloud.gluster.org on Mar 21, 2016 10:59:43 PM
nbslave7h.cloud.gluster.org on Apr 13, 2016 3:15:06 AM


Raghavendra Talur:
nbslave7g.cloud.gluster.org on Mar 29, 2016 2:27:20 AM


Jeff Darcy:
nbslave7i.cloud.gluster.org on Feb 27, 2016 9:09:09 PM


Thanks,
--
Prasanna

>
> Cheers,
> Niels
>
>>
>>
>> CentOs slaves: Hardly (2/14) salves are online [1]
>>
>> slave20.cloud.gluster.org (online)
>> slave21.cloud.gluster.org [Offline Reason: This node is offline
>> because Jenkins failed to launch the slave agent on it.]
>> slave22.cloud.gluster.org (online)
>> slave23.cloud.gluster.org [Offline Reason: This node is offline
>> because Jenkins failed to launch the slave agent on it.]
>> slave24.cloud.gluster.org [Offline Reason: This node is offline
>> because Jenkins failed to launch the slave agent on it.]
>> slave25.cloud.gluster.org [Offline Reason: This node is offline
>> because Jenkins failed to launch the slave agent on it.]
>> slave26.cloud.gluster.org [Offline Reason: This node is offline
>> because Jenkins failed to launch the slave agent on it.]
>> slave27.cloud.gluster.org [Offline Reason: Disconnected by rastar :
>> rastar taking this down for pranith. Needed for debugging with tar
>> issue.  Apr 20, 2016 3:44:14 AM]
>> slave28.cloud.gluster.org [Offline Reason: This node is offline
>> because Jenkins failed to launch the slave agent on it.]
>> slave29.cloud.gluster.org [Offline Reason: This node is offline
>> because Jenkins failed to launch the slave agent on it.]
>>
>> slave32.cloud.gluster.org [Offline Reason: idle]
>> slave33.cloud.gluster.org [Offline Reason: idle]
>> slave34.cloud.gluster.org [Offline Reason: idle]
>>
>> slave46.cloud.gluster.org [Offline 

Re: [Gluster-devel] [Gluster-infra] regression machines reporting slowly ? here is the reason ...

2016-04-24 Thread Prasanna Kalever
On Sun, Apr 24, 2016 at 7:11 AM, Vijay Bellur  wrote:
> On Sat, Apr 23, 2016 at 9:30 AM, Prasanna Kalever  wrote:
>> Hi all,
>>
>> Noticed our regression machines are reporting back really slow,
>> especially CentOs and Smoke
>>
>> I found that most of the slaves are marked offline, this could be the
>> biggest reasons ?
>>
>>
>
> Regression machines are scheduled to be offline if there are no active
> jobs. I wonder if the slowness is related to LVM or related factors as
> detailed in a recent thread?
>

Sorry, the previous mail was sent incomplete (blame some Gmail shortcut)

Hi Vijay,

Honestly I was not aware of this case where the machines move to
offline state by them self, I was only aware that they just go to idle
state,
Thanks for sharing that information. But we still need to reclaim most
of machines, Here are the reasons why each of them are offline.


CentOs slaves: Hardly (2/14) salves are online [1]

slave20.cloud.gluster.org (online)
slave21.cloud.gluster.org [Offline Reason: This node is offline
because Jenkins failed to launch the slave agent on it.]
slave22.cloud.gluster.org (online)
slave23.cloud.gluster.org [Offline Reason: This node is offline
because Jenkins failed to launch the slave agent on it.]
slave24.cloud.gluster.org [Offline Reason: This node is offline
because Jenkins failed to launch the slave agent on it.]
slave25.cloud.gluster.org [Offline Reason: This node is offline
because Jenkins failed to launch the slave agent on it.]
slave26.cloud.gluster.org [Offline Reason: This node is offline
because Jenkins failed to launch the slave agent on it.]
slave27.cloud.gluster.org [Offline Reason: Disconnected by rastar :
rastar taking this down for pranith. Needed for debugging with tar
issue.  Apr 20, 2016 3:44:14 AM]
slave28.cloud.gluster.org [Offline Reason: This node is offline
because Jenkins failed to launch the slave agent on it.]
slave29.cloud.gluster.org [Offline Reason: This node is offline
because Jenkins failed to launch the slave agent on it.]

slave32.cloud.gluster.org [Offline Reason: idle]
slave33.cloud.gluster.org [Offline Reason: idle]
slave34.cloud.gluster.org [Offline Reason: idle]

slave46.cloud.gluster.org [Offline Reason: This node is offline
because Jenkins failed to launch the slave agent on it.]




Smoke slaves:  Hardly (2/15) slaves are online [2]

slave20.cloud.gluster.org (onine)
slave21.cloud.gluster.org [Offline Reason: This node is offline
because Jenkins failed to launch the slave agent on it.]
slave22.cloud.gluster.org (online)
slave23.cloud.gluster.org [Offline Reason: This node is offline
because Jenkins failed to launch the slave agent on it.]
slave24.cloud.gluster.org [Offline Reason: This node is offline
because Jenkins failed to launch the slave agent on it.]
slave25.cloud.gluster.org [Offline Reason: This node is offline
because Jenkins failed to launch the slave agent on it.]
slave26.cloud.gluster.org [Offline Reason: This node is offline
because Jenkins failed to launch the slave agent on it.]
slave27.cloud.gluster.org [Offline Reason: Disconnected by rastar :
rastar taking this down for pranith. Needed for debugging with tar
issue.Apr 20, 2016 3:44:14 AM]
slave28.cloud.gluster.org [Offline Reason: This node is offline
because Jenkins failed to launch the slave agent on it.]
slave29.cloud.gluster.org [Offline Reason: This node is offline
because Jenkins failed to launch the slave agent on it.]

slave32.cloud.gluster.org [Offline Reason: idle]
slave33.cloud.gluster.org [Offline Reason: idle]
slave34.cloud.gluster.org [Offline Reason: idle]

slave46.cloud.gluster.org [Offline Reason: This node is offline
because Jenkins failed to launch the slave agent on it.]
slave47.cloud.gluster.org [Offline Reason: idle]




Netbsd slaves:   Only (6 /11) are online [3]

nbslave71.cloud.gluster.org (online)
nbslave72.cloud.gluster.org [Offline Reason: This node is offline
because Jenkins failed to launch the slave agent on it.]
nbslave74.cloud.gluster.org [Ofline Reason: Disconnected by kaushal
Mar 21, 2016 10:59:43 PM]
nbslave75.cloud.gluster.org (online)
nbslave77.cloud.gluster.org (online)
nbslave79.cloud.gluster.org (online)

nbslave7c.cloud.gluster.org (online)
nbslave7g.cloud.gluster.org [Ofline Reason: Disconnected by rastar :
anoop is using this to debug netbsd related issue Mar 29, 2016 2:27:20
AM]
nbslave7h.cloud.gluster.org [Ofline Reason: Disconnected by kaushal
Apr 13, 2016 3:15:06 AM]
nbslave7i.cloud.gluster.org [Ofline Reason: Disconnected by jdarcy :
Consistently generating spurious failures due to ping timeouts. This
costs people *hours* for a platform nobody uses except as a test for
perfused. Feb 27, 2016 9:09:09 PM]
nbslave7j.cloud.gluster.org (online)


Summary:

For CentOs Regressions: 9/14 slaves were completely down  [not just idle]
For Smoke: 9/15 slaves were completely down
For Netbsd Regressions: 5/11 slaves were completely down.

IIRC, for CentOs regression and Smoke jobs we 

Re: [Gluster-devel] [Gluster-infra] regression machines reporting slowly ? here is the reason ...

2016-04-24 Thread Prasanna Kalever
On Sun, Apr 24, 2016 at 7:11 AM, Vijay Bellur  wrote:
> On Sat, Apr 23, 2016 at 9:30 AM, Prasanna Kalever  wrote:
>> Hi all,
>>
>> Noticed our regression machines are reporting back really slow,
>> especially CentOs and Smoke
>>
>> I found that most of the slaves are marked offline, this could be the
>> biggest reasons ?
>>
>>
>
> Regression machines are scheduled to be offline if there are no active
> jobs. I wonder if the slowness is related to LVM or related factors as
> detailed in a recent thread?

Hi Vijay,

Honestly I was not aware of this case where the machines move to
offline state by them self, I was aware that they just go to idle
state, Thanks for sharing this information.
But we still need to reclaim most of machines,


CentOs slaves: Hardly (2/14) salves are online [1]

slave20.cloud.gluster.org (online)
slave21.cloud.gluster.org [Offline Reason: This node is offline
because Jenkins failed to launch the slave agent on it.]
slave22.cloud.gluster.org (online)
slave23.cloud.gluster.org [Offline Reason: This node is offline
because Jenkins failed to launch the slave agent on it.]
slave24.cloud.gluster.org [Offline Reason: This node is offline
because Jenkins failed to launch the slave agent on it.]
slave25.cloud.gluster.org [Offline Reason: This node is offline
because Jenkins failed to launch the slave agent on it.]
slave26.cloud.gluster.org [Offline Reason: This node is offline
because Jenkins failed to launch the slave agent on it.]
slave27.cloud.gluster.org [Offline Reason: Disconnected by rastar :
rastar taking this down for pranith. Needed for debugging with tar
issue.  Apr 20, 2016 3:44:14 AM]
slave28.cloud.gluster.org [Offline Reason: This node is offline
because Jenkins failed to launch the slave agent on it.]
slave29.cloud.gluster.org [Offline Reason: This node is offline
because Jenkins failed to launch the slave agent on it.]

slave32.cloud.gluster.org [Offline Reason: idle]
slave33.cloud.gluster.org [Offline Reason: idle]
slave34.cloud.gluster.org [Offline Reason: idle]

slave46.cloud.gluster.org [Offline Reason: This node is offline
because Jenkins failed to launch the slave agent on it.]




Smoke slaves:  Hardly (2/15) slaves are online [2]

slave20.cloud.gluster.org (onine)
slave21.cloud.gluster.org [Offline Reason: This node is offline
because Jenkins failed to launch the slave agent on it.]
slave22.cloud.gluster.org (online)
slave23.cloud.gluster.org [Offline Reason: This node is offline
because Jenkins failed to launch the slave agent on it.]
slave24.cloud.gluster.org [Offline Reason: This node is offline
because Jenkins failed to launch the slave agent on it.]
slave25.cloud.gluster.org [Offline Reason: This node is offline
because Jenkins failed to launch the slave agent on it.]
slave26.cloud.gluster.org [Offline Reason: This node is offline
because Jenkins failed to launch the slave agent on it.]
slave27.cloud.gluster.org [Offline Reason: Disconnected by rastar :
rastar taking this down for pranith. Needed for debugging with tar
issue.Apr 20, 2016 3:44:14 AM]
slave28.cloud.gluster.org [Offline Reason: This node is offline
because Jenkins failed to launch the slave agent on it.]
slave29.cloud.gluster.org [Offline Reason: This node is offline
because Jenkins failed to launch the slave agent on it.]

slave32.cloud.gluster.org [Offline Reason: idle]
slave33.cloud.gluster.org [Offline Reason: idle]
slave34.cloud.gluster.org [Offline Reason: idle]

slave46.cloud.gluster.org [Offline Reason: This node is offline
because Jenkins failed to launch the slave agent on it.]
slave47.cloud.gluster.org [Offline Reason: idle]



`
Netbsd slaves:   Only (6 /11) are online [3]

nbslave71.cloud.gluster.org (online)
nbslave72.cloud.gluster.org [Offline Reason: This node is offline
because Jenkins failed to launch the slave agent on it.]
nbslave74.cloud.gluster.org [Ofline Reason: Disconnected by kaushal
Mar 21, 2016 10:59:43 PM]
nbslave75.cloud.gluster.org (online)
nbslave77.cloud.gluster.org (online)
nbslave79.cloud.gluster.org (online)

nbslave7c.cloud.gluster.org (online)
nbslave7g.cloud.gluster.org [Ofline Reason: Disconnected by rastar :
anoop is using this to debug netbsd related issue Mar 29, 2016 2:27:20
AM]
nbslave7h.cloud.gluster.org [Ofline Reason: Disconnected by kaushal
Apr 13, 2016 3:15:06 AM]
nbslave7i.cloud.gluster.org [Ofline Reason: Disconnected by jdarcy :
Consistently generating spurious failures due to ping timeouts. This
costs people *hours* for a platform nobody uses except as a test for
perfused.
Feb 27, 2016 9:09:09 PM]
nbslave7j.cloud.gluster.org (online)









For CentOs regressions: 9/14 slaves were completly down  [not just idle]
For Smoke: 9/15 slaves, that's a good number
Netbsd Regresstion: We can to reclaim 5/11 slaves, that's a good number










https://build.gluster.org/label/rackspace_regression_2gb/
https://build.gluster.org/label/smoke_tests/
https://build.gluster.org/label/netbsd7_regression/





Re: [Gluster-devel] [Gluster-infra] regression machines reporting slowly ? here is the reason ...

2016-04-23 Thread Vijay Bellur
On Sat, Apr 23, 2016 at 9:30 AM, Prasanna Kalever  wrote:
> Hi all,
>
> Noticed our regression machines are reporting back really slow,
> especially CentOs and Smoke
>
> I found that most of the slaves are marked offline, this could be the
> biggest reasons ?
>
>

Regression machines are scheduled to be offline if there are no active
jobs. I wonder if the slowness is related to LVM or related factors as
detailed in a recent thread?

-Vijay
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel