Re: [Gluster-infra] [Gluster-devel] Jenkins Issues this weekend and how we're solving them

2018-02-19 Thread Nigel Babu
On Mon, Feb 19, 2018 at 5:58 PM, Nithya Balachandran 
wrote:

>
>
> On 19 February 2018 at 13:12, Atin Mukherjee  wrote:
>
>>
>>
>> On Mon, Feb 19, 2018 at 8:53 AM, Nigel Babu  wrote:
>>
>>> Hello,
>>>
>>> As you all most likely know, we store the tarball of the binaries and
>>> core if there's a core during regression. Occasionally, we've introduced a
>>> bug in Gluster and this tar can take up a lot of space. This has happened
>>> recently with brick multiplex tests. The build-install tar takes up 25G,
>>> causing the machine to run out of space and continuously fail.
>>>
>>
>> AFAIK, we don't have a .t file in upstream regression suits where
>> hundreds of volumes are created. With that scale and brick multiplexing
>> enabled, I can understand the core will be quite heavy loaded and may
>> consume up to this much of crazy amount of space. FWIW, can we first try to
>> figure out which test was causing this crash and see if running a gcore
>> after a certain steps in the tests do left us with a similar size of the
>> core file? IOW, have we actually seen such huge size of core file generated
>> earlier? If not, what changed because which we've started seeing this is
>> something to be invested on.
>>
>
> We also need to check if this is only the core file that is causing the
> increase in size or whether there is something else that is taking up a lot
> of space.
>
>
I don't disagree. However there are two problems here. In the few cases
where we've had such a large build-install tarball,

1. The tar doesn't actually finish being created. So it's not even
something that can be untar'd. It would just error out.
2. All subsequent jobs on this node fail.

The only remaining option is to watch out for situations when the tar file
doesn't finish creation and highlight it. When we moved to chunked
regressions, the nodes do not get re-used, so 2 isn't a problem.

-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Infra machines update

2018-02-19 Thread Nigel Babu
Hello folks,

We're all out of Centos 6 nodes from today. I've just deleted the last of
them. We now run exclusively on Centos 7 nodes.

We've not received any negative feedback about plans to move NetBSD, so
I've disabled and removed all the NetBSD jobs and nodes as well.

-- 
nigelb
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] [Bug 1546040] Need centos machine to validate all test cases while brick mux is on

2018-02-19 Thread bugzilla
https://bugzilla.redhat.com/show_bug.cgi?id=1546040

Mohit Agrawal  changed:

   What|Removed |Added

  Flags||needinfo?(nig...@redhat.com
   ||)



-- 
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug 
https://bugzilla.redhat.com/token.cgi?t=I2tGJ35DCY=cc_unsubscribe
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra


[Gluster-infra] [Bug 1546040] Need centos machine to validate all test cases while brick mux is on

2018-02-19 Thread bugzilla
https://bugzilla.redhat.com/show_bug.cgi?id=1546040

Nigel Babu  changed:

   What|Removed |Added

  Flags|needinfo?(nig...@redhat.com |
   |)   |



--- Comment #3 from Nigel Babu  ---
Waiting for the machine to build. It's going to take about an hour or so.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug 
https://bugzilla.redhat.com/token.cgi?t=jWGruj0R4t=cc_unsubscribe
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra


[Gluster-infra] [Bug 1546645] New: Ansible should setup correct selinux context for / archives

2018-02-19 Thread bugzilla
https://bugzilla.redhat.com/show_bug.cgi?id=1546645

Bug ID: 1546645
   Summary: Ansible should setup correct selinux context for
/archives
   Product: GlusterFS
   Version: 4.0
 Component: project-infrastructure
  Assignee: b...@gluster.org
  Reporter: nig...@redhat.com
CC: b...@gluster.org, gluster-infra@gluster.org



The default node setup does not allow the files in /archives to be actually
visible via Nginx. We need to make the selinux setup part of the ansible role.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug 
https://bugzilla.redhat.com/token.cgi?t=CKyyUZ6TxS=cc_unsubscribe
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra


Re: [Gluster-infra] [Gluster-devel] Jenkins Issues this weekend and how we're solving them

2018-02-19 Thread Sankarshan Mukhopadhyay
On Mon, Feb 19, 2018 at 5:58 PM, Nithya Balachandran
 wrote:
>
>
> On 19 February 2018 at 13:12, Atin Mukherjee  wrote:
>>
>>
>>
>> On Mon, Feb 19, 2018 at 8:53 AM, Nigel Babu  wrote:
>>>
>>> Hello,
>>>
>>> As you all most likely know, we store the tarball of the binaries and
>>> core if there's a core during regression. Occasionally, we've introduced a
>>> bug in Gluster and this tar can take up a lot of space. This has happened
>>> recently with brick multiplex tests. The build-install tar takes up 25G,
>>> causing the machine to run out of space and continuously fail.
>>
>>
>> AFAIK, we don't have a .t file in upstream regression suits where hundreds
>> of volumes are created. With that scale and brick multiplexing enabled, I
>> can understand the core will be quite heavy loaded and may consume up to
>> this much of crazy amount of space. FWIW, can we first try to figure out
>> which test was causing this crash and see if running a gcore after a certain
>> steps in the tests do left us with a similar size of the core file? IOW,
>> have we actually seen such huge size of core file generated earlier? If not,
>> what changed because which we've started seeing this is something to be
>> invested on.
>
>
> We also need to check if this is only the core file that is causing the
> increase in size or whether there is something else that is taking up a lot
> of space.
>>
>>
>>>
>>>
>>> I've made some changes this morning. Right after we create the tarball,
>>> we'll delete all files in /archive that are greater than 1G. Please be aware
>>> that this means all large files including the newly created tarball will be
>>> deleted. You will have to work with the traceback on the Jenkins job.
>>
>>
>> We'd really need to first investigate on the average size of the core file
>> what we can get with when a system is running with brick multiplexing and
>> ongoing I/O. With out that immediately deleting the core files > 1G will
>> cause trouble to the developers in debugging genuine crashes as traceback
>> alone may not be sufficient.
>>

I'd like to echo what Nithya writes - instead of treating this
incident as an outlier, we might want to do further analysis. If this
has happened on a production system - there would be blood.
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra


Re: [Gluster-infra] [Gluster-devel] Jenkins Issues this weekend and how we're solving them

2018-02-19 Thread Nithya Balachandran
On 19 February 2018 at 13:12, Atin Mukherjee  wrote:

>
>
> On Mon, Feb 19, 2018 at 8:53 AM, Nigel Babu  wrote:
>
>> Hello,
>>
>> As you all most likely know, we store the tarball of the binaries and
>> core if there's a core during regression. Occasionally, we've introduced a
>> bug in Gluster and this tar can take up a lot of space. This has happened
>> recently with brick multiplex tests. The build-install tar takes up 25G,
>> causing the machine to run out of space and continuously fail.
>>
>
> AFAIK, we don't have a .t file in upstream regression suits where hundreds
> of volumes are created. With that scale and brick multiplexing enabled, I
> can understand the core will be quite heavy loaded and may consume up to
> this much of crazy amount of space. FWIW, can we first try to figure out
> which test was causing this crash and see if running a gcore after a
> certain steps in the tests do left us with a similar size of the core file?
> IOW, have we actually seen such huge size of core file generated earlier?
> If not, what changed because which we've started seeing this is something
> to be invested on.
>

We also need to check if this is only the core file that is causing the
increase in size or whether there is something else that is taking up a lot
of space.

>
>
>>
>> I've made some changes this morning. Right after we create the tarball,
>> we'll delete all files in /archive that are greater than 1G. Please be
>> aware that this means all large files including the newly created tarball
>> will be deleted. You will have to work with the traceback on the Jenkins
>> job.
>>
>
> We'd really need to first investigate on the average size of the core file
> what we can get with when a system is running with brick multiplexing and
> ongoing I/O. With out that immediately deleting the core files > 1G will
> cause trouble to the developers in debugging genuine crashes as traceback
> alone may not be sufficient.
>
>
>>
>>
>>
>> --
>> nigelb
>>
>> ___
>> Gluster-devel mailing list
>> gluster-de...@gluster.org
>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>>
>
>
> ___
> Gluster-devel mailing list
> gluster-de...@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] [Gluster-devel] Jenkins Issues this weekend and how we're solving them

2018-02-19 Thread Mohit Agrawal
Hi,

  I think I know the reason why tarball size is bigger, could it happen if
tar file has more than one core.
  I triggered a build(https://review.gluster.org/19574 to validate all test
cases after enable brick mux) after update "exit_one_failure="no"" in
run-tests.sh
  so build has executed all test cases and in the earlier version of the
patch, i was getting multiple cores.

  Now it is generating only one core, it seems other code paths are fixed
so the issue should be resolved now.


Regards
Mohit Agrawal

On Mon, Feb 19, 2018 at 6:07 PM, Sankarshan Mukhopadhyay <
sankarshan.mukhopadh...@gmail.com> wrote:

> On Mon, Feb 19, 2018 at 5:58 PM, Nithya Balachandran
>  wrote:
> >
> >
> > On 19 February 2018 at 13:12, Atin Mukherjee 
> wrote:
> >>
> >>
> >>
> >> On Mon, Feb 19, 2018 at 8:53 AM, Nigel Babu  wrote:
> >>>
> >>> Hello,
> >>>
> >>> As you all most likely know, we store the tarball of the binaries and
> >>> core if there's a core during regression. Occasionally, we've
> introduced a
> >>> bug in Gluster and this tar can take up a lot of space. This has
> happened
> >>> recently with brick multiplex tests. The build-install tar takes up
> 25G,
> >>> causing the machine to run out of space and continuously fail.
> >>
> >>
> >> AFAIK, we don't have a .t file in upstream regression suits where
> hundreds
> >> of volumes are created. With that scale and brick multiplexing enabled,
> I
> >> can understand the core will be quite heavy loaded and may consume up to
> >> this much of crazy amount of space. FWIW, can we first try to figure out
> >> which test was causing this crash and see if running a gcore after a
> certain
> >> steps in the tests do left us with a similar size of the core file? IOW,
> >> have we actually seen such huge size of core file generated earlier? If
> not,
> >> what changed because which we've started seeing this is something to be
> >> invested on.
> >
> >
> > We also need to check if this is only the core file that is causing the
> > increase in size or whether there is something else that is taking up a
> lot
> > of space.
> >>
> >>
> >>>
> >>>
> >>> I've made some changes this morning. Right after we create the tarball,
> >>> we'll delete all files in /archive that are greater than 1G. Please be
> aware
> >>> that this means all large files including the newly created tarball
> will be
> >>> deleted. You will have to work with the traceback on the Jenkins job.
> >>
> >>
> >> We'd really need to first investigate on the average size of the core
> file
> >> what we can get with when a system is running with brick multiplexing
> and
> >> ongoing I/O. With out that immediately deleting the core files > 1G will
> >> cause trouble to the developers in debugging genuine crashes as
> traceback
> >> alone may not be sufficient.
> >>
>
> I'd like to echo what Nithya writes - instead of treating this
> incident as an outlier, we might want to do further analysis. If this
> has happened on a production system - there would be blood.
>
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra