Re: [Gluster-infra] Looks like glusterfs's smoke job is not running for the patches posted

2018-08-15 Thread Kotresh Hiremath Ravishankar
ok, didn't know that.

On Thu, Aug 16, 2018 at 8:06 AM, Nigel Babu  wrote:

> This is something I've highlighted in the past. If you trigger regression
> and smoke at the same time, smoke will only vote after regression job is
> done. That's Jenkins optimizing the communication with Gerrit so it needs
> to do the voting only once. This is a feature and not a bug.
>
> On Wed, Aug 15, 2018 at 11:32 PM Kotresh Hiremath Ravishankar <
> khire...@redhat.com> wrote:
>
>> The job triggered for me but not flagged +1
>>
>> Reference: https://review.gluster.org/#/c/glusterfs/+/20548/
>>
>
>
> --
> nigelb
>



-- 
Thanks and Regards,
Kotresh H R
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] Portmortem for gluster jenkins disk full outage on the 15th of August

2018-08-15 Thread Nigel Babu
On Wed, Aug 15, 2018 at 2:41 PM Michael Scherer  wrote:

> Hi folks,
>
> So Gluster jenkins disk was full today (cause outages do not respect
> public holiday in India (Independance day) and France(Assumption)),
> here is the post mortem for your reading pleasure
>
> Date: 15/08/2018
>
> Service affected:
>   Jenkins for Gluster (jenkins-el7.rht.gluster.org)
>
> Impact:
>
>   No jenkins job could be triggered.
>
> Root cause:
>
>   A disk full mainly because we got new jobs and more patches, so
> regular growth.
>
> Resolution:
>
>   Increased the disk by 30G, and investigating if cleanup could be
>   improved. This did require a reboot.
>
>
> Involved people:
> - misc
> - nigel
>
> Lessons learned
> - What went well:
>   - we had a documented process for that, and good enough to be used by
> a tired admin.
>
> - What went bad:
>   - we weren't proactive enough to see that before it caused a outage
>   - 15 of August is a holiday for both France and India. Technically,
> none of the infra team should have been up.
>
> - When we were lucky
>   - It was a day off in India, so few people were affected, except
> folks who continue to work on days off
>   - Misc decided to go to work while being in Brno to take days off
> later
>
>
> Timeline (in UTC)
>
> - 05:58 Amar post a mail to say "smoke job fail" on gluster-infra:
> https://lists.gluster.org/pipermail/gluster-infra/2018-August/004795.ht
> ml
> 
>
> - 06:23 Nigel ping Misc on Telegram to deal with it, since Nigel is
> away from laptop for Independence day celebration.
>
> - 06:24 Misc do not hear the ding since he is asleep
>
> - 06:55 Sankarshan open a bug on it, https://bugzilla.redhat.com/show_b
> ug.cgi?id=1616160 
>
> - 06:56 Misc do not see the email since he is still asleep
>
> - 07:13 Misc wake up, see a blinking light on the phone and ponder
> about closing his eyes again. He look at it, and start to swear.
>
> - 07:14 Investigation reveal that Jenkins partition is full (100%). A
> quick investigation do not yield any particular issues. The Jenkins
> jobs are taking space and that's it.
>
> - 07:19 After discussion with Nigel, it is decided to increase the size
> of the partition. Misc take a look at it, try to increase without any
> luck. The server is rebooted in case that's what was needed. Still not
> enough.
>
> - 07:25 Misc go quickly shower to wake him up. The warm embrace of
> water make him remember that a documentation on that process do exist:
>
> https://gluster-infra-docs.readthedocs.io/procedures/resize_vm_partitio
> n.html
> 
>
> - 07:30  Following the documentation, we discover that the hypervisor
> is now out of space for future increase. Looking at that will be done
> after the post mortem.
>
> - 07:37 Jenkins is being restarted, with more space, and seems to work
> ok.
>
> - 07:38 Misc rush to his hotel breakfast who close at 10.
>
> - 09:09 Post mortem is finished and being sent
>
>
> Action items:
> - (misc) see what can be done for myrmicinae (the hypervisor where
> jenkins is running) since there is no more space.
>
> Potential improvement to make:
> - we still need to have monitoring in place
> - we need to move munin in the internal lan for looking at the graph
> for jenkins
> - documentation regarding resizing could be clearer, notably on volume
> resizing part
>

This is highlighting that we need to solve
https://bugzilla.redhat.com/show_bug.cgi?id=1564372 on priority. The lack
of monitoring is affecting day to day work.

-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] Looks like glusterfs's smoke job is not running for the patches posted

2018-08-15 Thread Nigel Babu
This is something I've highlighted in the past. If you trigger regression
and smoke at the same time, smoke will only vote after regression job is
done. That's Jenkins optimizing the communication with Gerrit so it needs
to do the voting only once. This is a feature and not a bug.

On Wed, Aug 15, 2018 at 11:32 PM Kotresh Hiremath Ravishankar <
khire...@redhat.com> wrote:

> The job triggered for me but not flagged +1
>
> Reference: https://review.gluster.org/#/c/glusterfs/+/20548/
>


-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] Looks like glusterfs's smoke job is not running for the patches posted

2018-08-15 Thread Kotresh Hiremath Ravishankar
The job triggered for me but not flagged +1

Reference: https://review.gluster.org/#/c/glusterfs/+/20548/

On Wed, Aug 15, 2018 at 12:26 PM, Sankarshan Mukhopadhyay <
sankarshan.mukhopadh...@gmail.com> wrote:

> On Wed, Aug 15, 2018 at 11:28 AM Amar Tumballi 
> wrote:
> >
> > Not sure why even when I did 'recheck smoke' the job didn't get
> triggered.
> >
>
> Now at 
>
> --
> sankarshan mukhopadhyay
> 
> ___
> Gluster-infra mailing list
> Gluster-infra@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-infra
>



-- 
Thanks and Regards,
Kotresh H R
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] [shadow-it] Portmortem for gluster jenkins disk full outage on the 15th of August

2018-08-15 Thread Michael Scherer
Le mercredi 15 août 2018 à 11:10 +0200, Michael Scherer a écrit :
> Hi folks,
> 
> So Gluster jenkins disk was full today (cause outages do not respect
> public holiday in India (Independance day) and France(Assumption)),
> here is the post mortem for your reading pleasure
> 
> Date: 15/08/2018
> 
> Service affected:
>   Jenkins for Gluster (jenkins-el7.rht.gluster.org)
> 
> Impact:
> 
>   No jenkins job could be triggered.
> 
> Root cause:
> 
>   A disk full mainly because we got new jobs and more patches, so
> regular growth.
> 
> Resolution:
> 
>   Increased the disk by 30G, and investigating if cleanup could be  
>   improved. This did require a reboot.
>
> []
> 
> Action items:
> - (misc) see what can be done for myrmicinae (the hypervisor where
> jenkins is running) since there is no more space.

So I looked at myrmicinae, and:
- we have only 23G free for VMs

- there is a 300G partition for the old VM of jenkins/gerrit that we
migrated last november. I kept it to be able to recover if needed, but
I guess that's no longer needed. 

I will sync with Nigel to make extra sure that we can remove this
partition.

-- 
Michael Scherer
Sysadmin, Community Infrastructure and Platform, OSAS



signature.asc
Description: This is a digitally signed message part
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] Portmortem for gluster jenkins disk full outage on the 15th of August

2018-08-15 Thread Michael Scherer
Le mercredi 15 août 2018 à 14:50 +0530, Sankarshan Mukhopadhyay a
écrit :
> Thank you for (a) addressing the issue and (b) this write up
> 
> Does the -infra team have a way to monitor disk space usage?

Munin:
http://munin.gluster.org/munin/rht.gluster.org/jenkins-el7.rht.gluster.
org/index.html#disk

Seems we did had notifications, but that was turned (by me) on May 2106
with a laconic "receiving too much of them for now". I guess it was
sending too much false positive and we didn't spend time to fix that.


I want to move it out of rackspace since a long time, since it can't
monitor the internal network, and also move to nagios for alerting
(since you can filter alert).



> On Wed, Aug 15, 2018 at 2:40 PM Michael Scherer 
> wrote:
> > 
> > Hi folks,
> > 
> > So Gluster jenkins disk was full today (cause outages do not
> > respect
> > public holiday in India (Independance day) and France(Assumption)),
> > here is the post mortem for your reading pleasure
> > 
> > Date: 15/08/2018
> > 
> > Service affected:
> >   Jenkins for Gluster (jenkins-el7.rht.gluster.org)
> > 
> > Impact:
> > 
> >   No jenkins job could be triggered.
> > 
> > Root cause:
> > 
> >   A disk full mainly because we got new jobs and more patches, so
> > regular growth.
> > 
> > Resolution:
> > 
> >   Increased the disk by 30G, and investigating if cleanup could be
> >   improved. This did require a reboot.
> > 
> > 
> > Involved people:
> > - misc
> > - nigel
> > 
> > Lessons learned
> > - What went well:
> >   - we had a documented process for that, and good enough to be
> > used by
> > a tired admin.
> > 
> > - What went bad:
> >   - we weren't proactive enough to see that before it caused a
> > outage
> >   - 15 of August is a holiday for both France and India.
> > Technically,
> > none of the infra team should have been up.
> > 
> > - When we were lucky
> >   - It was a day off in India, so few people were affected, except
> > folks who continue to work on days off
> >   - Misc decided to go to work while being in Brno to take days off
> > later
> > 
> > 
> > Timeline (in UTC)
> > 
> > - 05:58 Amar post a mail to say "smoke job fail" on gluster-infra:
> > https://lists.gluster.org/pipermail/gluster-infra/2018-August/00479
> > 5.ht
> > ml
> > 
> > - 06:23 Nigel ping Misc on Telegram to deal with it, since Nigel is
> > away from laptop for Independence day celebration.
> > 
> > - 06:24 Misc do not hear the ding since he is asleep
> > 
> > - 06:55 Sankarshan open a bug on it, https://bugzilla.redhat.com/sh
> > ow_b
> > ug.cgi?id=1616160
> > 
> > - 06:56 Misc do not see the email since he is still asleep
> > 
> > - 07:13 Misc wake up, see a blinking light on the phone and ponder
> > about closing his eyes again. He look at it, and start to swear.
> > 
> > - 07:14 Investigation reveal that Jenkins partition is full (100%).
> > A
> > quick investigation do not yield any particular issues. The Jenkins
> > jobs are taking space and that's it.
> > 
> > - 07:19 After discussion with Nigel, it is decided to increase the
> > size
> > of the partition. Misc take a look at it, try to increase without
> > any
> > luck. The server is rebooted in case that's what was needed. Still
> > not
> > enough.
> > 
> > - 07:25 Misc go quickly shower to wake him up. The warm embrace of
> > water make him remember that a documentation on that process do
> > exist:
> > 
> > https://gluster-infra-docs.readthedocs.io/procedures/resize_vm_part
> > itio
> > n.html
> > 
> > - 07:30  Following the documentation, we discover that the
> > hypervisor
> > is now out of space for future increase. Looking at that will be
> > done
> > after the post mortem.
> > 
> > - 07:37 Jenkins is being restarted, with more space, and seems to
> > work
> > ok.
> > 
> > - 07:38 Misc rush to his hotel breakfast who close at 10.
> > 
> > - 09:09 Post mortem is finished and being sent
> > 
> > 
> > Action items:
> > - (misc) see what can be done for myrmicinae (the hypervisor where
> > jenkins is running) since there is no more space.
> > 
> > Potential improvement to make:
> > - we still need to have monitoring in place
> > - we need to move munin in the internal lan for looking at the
> > graph
> > for jenkins
> > - documentation regarding resizing could be clearer, notably on
> > volume
> > resizing part
> > 
> > 
> > --
> > Michael Scherer
> > Sysadmin, Community Infrastructure and Platform, OSAS
> > 
> > ___
> > Gluster-infra mailing list
> > Gluster-infra@gluster.org
> > https://lists.gluster.org/mailman/listinfo/gluster-infra
> 
> 
> 
-- 
Michael Scherer
Sysadmin, Community Infrastructure and Platform, OSAS



signature.asc
Description: This is a digitally signed message part
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] Portmortem for gluster jenkins disk full outage on the 15th of August

2018-08-15 Thread Sankarshan Mukhopadhyay
Thank you for (a) addressing the issue and (b) this write up

Does the -infra team have a way to monitor disk space usage?

On Wed, Aug 15, 2018 at 2:40 PM Michael Scherer  wrote:
>
> Hi folks,
>
> So Gluster jenkins disk was full today (cause outages do not respect
> public holiday in India (Independance day) and France(Assumption)),
> here is the post mortem for your reading pleasure
>
> Date: 15/08/2018
>
> Service affected:
>   Jenkins for Gluster (jenkins-el7.rht.gluster.org)
>
> Impact:
>
>   No jenkins job could be triggered.
>
> Root cause:
>
>   A disk full mainly because we got new jobs and more patches, so
> regular growth.
>
> Resolution:
>
>   Increased the disk by 30G, and investigating if cleanup could be
>   improved. This did require a reboot.
>
>
> Involved people:
> - misc
> - nigel
>
> Lessons learned
> - What went well:
>   - we had a documented process for that, and good enough to be used by
> a tired admin.
>
> - What went bad:
>   - we weren't proactive enough to see that before it caused a outage
>   - 15 of August is a holiday for both France and India. Technically,
> none of the infra team should have been up.
>
> - When we were lucky
>   - It was a day off in India, so few people were affected, except
> folks who continue to work on days off
>   - Misc decided to go to work while being in Brno to take days off
> later
>
>
> Timeline (in UTC)
>
> - 05:58 Amar post a mail to say "smoke job fail" on gluster-infra:
> https://lists.gluster.org/pipermail/gluster-infra/2018-August/004795.ht
> ml
>
> - 06:23 Nigel ping Misc on Telegram to deal with it, since Nigel is
> away from laptop for Independence day celebration.
>
> - 06:24 Misc do not hear the ding since he is asleep
>
> - 06:55 Sankarshan open a bug on it, https://bugzilla.redhat.com/show_b
> ug.cgi?id=1616160
>
> - 06:56 Misc do not see the email since he is still asleep
>
> - 07:13 Misc wake up, see a blinking light on the phone and ponder
> about closing his eyes again. He look at it, and start to swear.
>
> - 07:14 Investigation reveal that Jenkins partition is full (100%). A
> quick investigation do not yield any particular issues. The Jenkins
> jobs are taking space and that's it.
>
> - 07:19 After discussion with Nigel, it is decided to increase the size
> of the partition. Misc take a look at it, try to increase without any
> luck. The server is rebooted in case that's what was needed. Still not
> enough.
>
> - 07:25 Misc go quickly shower to wake him up. The warm embrace of
> water make him remember that a documentation on that process do exist:
>
> https://gluster-infra-docs.readthedocs.io/procedures/resize_vm_partitio
> n.html
>
> - 07:30  Following the documentation, we discover that the hypervisor
> is now out of space for future increase. Looking at that will be done
> after the post mortem.
>
> - 07:37 Jenkins is being restarted, with more space, and seems to work
> ok.
>
> - 07:38 Misc rush to his hotel breakfast who close at 10.
>
> - 09:09 Post mortem is finished and being sent
>
>
> Action items:
> - (misc) see what can be done for myrmicinae (the hypervisor where
> jenkins is running) since there is no more space.
>
> Potential improvement to make:
> - we still need to have monitoring in place
> - we need to move munin in the internal lan for looking at the graph
> for jenkins
> - documentation regarding resizing could be clearer, notably on volume
> resizing part
>
>
> --
> Michael Scherer
> Sysadmin, Community Infrastructure and Platform, OSAS
>
> ___
> Gluster-infra mailing list
> Gluster-infra@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-infra



-- 
sankarshan mukhopadhyay

___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra


[Gluster-infra] Portmortem for gluster jenkins disk full outage on the 15th of August

2018-08-15 Thread Michael Scherer
Hi folks,

So Gluster jenkins disk was full today (cause outages do not respect
public holiday in India (Independance day) and France(Assumption)),
here is the post mortem for your reading pleasure

Date: 15/08/2018

Service affected:
  Jenkins for Gluster (jenkins-el7.rht.gluster.org)

Impact:

  No jenkins job could be triggered.

Root cause:

  A disk full mainly because we got new jobs and more patches, so
regular growth.

Resolution:

  Increased the disk by 30G, and investigating if cleanup could be  
  improved. This did require a reboot.


Involved people:
- misc
- nigel

Lessons learned
- What went well:
  - we had a documented process for that, and good enough to be used by
a tired admin.

- What went bad:
  - we weren't proactive enough to see that before it caused a outage
  - 15 of August is a holiday for both France and India. Technically, 
none of the infra team should have been up.

- When we were lucky
  - It was a day off in India, so few people were affected, except 
folks who continue to work on days off
  - Misc decided to go to work while being in Brno to take days off
later


Timeline (in UTC)

- 05:58 Amar post a mail to say "smoke job fail" on gluster-infra:
https://lists.gluster.org/pipermail/gluster-infra/2018-August/004795.ht
ml

- 06:23 Nigel ping Misc on Telegram to deal with it, since Nigel is
away from laptop for Independence day celebration.

- 06:24 Misc do not hear the ding since he is asleep

- 06:55 Sankarshan open a bug on it, https://bugzilla.redhat.com/show_b
ug.cgi?id=1616160 

- 06:56 Misc do not see the email since he is still asleep

- 07:13 Misc wake up, see a blinking light on the phone and ponder
about closing his eyes again. He look at it, and start to swear.

- 07:14 Investigation reveal that Jenkins partition is full (100%). A
quick investigation do not yield any particular issues. The Jenkins
jobs are taking space and that's it.

- 07:19 After discussion with Nigel, it is decided to increase the size
of the partition. Misc take a look at it, try to increase without any
luck. The server is rebooted in case that's what was needed. Still not
enough.

- 07:25 Misc go quickly shower to wake him up. The warm embrace of
water make him remember that a documentation on that process do exist:

https://gluster-infra-docs.readthedocs.io/procedures/resize_vm_partitio
n.html

- 07:30  Following the documentation, we discover that the hypervisor
is now out of space for future increase. Looking at that will be done
after the post mortem.

- 07:37 Jenkins is being restarted, with more space, and seems to work
ok.

- 07:38 Misc rush to his hotel breakfast who close at 10.

- 09:09 Post mortem is finished and being sent


Action items:
- (misc) see what can be done for myrmicinae (the hypervisor where
jenkins is running) since there is no more space.

Potential improvement to make:
- we still need to have monitoring in place
- we need to move munin in the internal lan for looking at the graph
for jenkins
- documentation regarding resizing could be clearer, notably on volume
resizing part


-- 
Michael Scherer
Sysadmin, Community Infrastructure and Platform, OSAS



signature.asc
Description: This is a digitally signed message part
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] [Bug 1616160] Looks like glusterfs' s smoke job is not running for the patches posted

2018-08-15 Thread bugzilla
https://bugzilla.redhat.com/show_bug.cgi?id=1616160

M. Scherer  changed:

   What|Removed |Added

  Comment #0 is|1   |0
private||
 Status|NEW |CLOSED
 CC||msche...@redhat.com
 Resolution|--- |CURRENTRELEASE
Last Closed||2018-08-15 04:38:11



--- Comment #1 from M. Scherer  ---
Done, post mortem is on its way.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug 
https://bugzilla.redhat.com/token.cgi?t=JNQDT8gmgs=cc_unsubscribe
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra


Re: [Gluster-infra] Looks like glusterfs's smoke job is not running for the patches posted

2018-08-15 Thread Sankarshan Mukhopadhyay
On Wed, Aug 15, 2018 at 11:28 AM Amar Tumballi  wrote:
>
> Not sure why even when I did 'recheck smoke' the job didn't get triggered.
>

Now at 

-- 
sankarshan mukhopadhyay

___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra


[Gluster-infra] [Bug 1616160] New: Looks like glusterfs' s smoke job is not running for the patches posted

2018-08-15 Thread bugzilla
https://bugzilla.redhat.com/show_bug.cgi?id=1616160

Bug ID: 1616160
   Summary: Looks like glusterfs's smoke job is not running for
the patches posted
   Product: GlusterFS
   Version: mainline
 Component: project-infrastructure
  Severity: high
  Assignee: b...@gluster.org
  Reporter: sankars...@redhat.com
CC: b...@gluster.org, gluster-infra@gluster.org



-- 
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug 
https://bugzilla.redhat.com/token.cgi?t=A6phufp6qZ=cc_unsubscribe
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra