Re: [Gluster-infra] [Gluster-devel] NetBSD regressions not being triggered for patches

2015-06-17 Thread Nithya Balachandran
- Original Message -
 From: Avra Sengupta aseng...@redhat.com
 To: Rajesh Joseph rjos...@redhat.com, Kaushal M kshlms...@gmail.com
 Cc: Gluster Devel gluster-de...@gluster.org, gluster-infra 
 gluster-infra@gluster.org
 Sent: Wednesday, June 17, 2015 1:42:25 PM
 Subject: Re: [Gluster-devel] [Gluster-infra] NetBSD regressions not being 
 triggered for patches
 
 On 06/17/2015 12:12 PM, Rajesh Joseph wrote:
 
  - Original Message -
  From: Kaushal M kshlms...@gmail.com
  To: Emmanuel Dreyfus m...@netbsd.org
  Cc: Gluster Devel gluster-de...@gluster.org, gluster-infra
  gluster-infra@gluster.org
  Sent: Wednesday, 17 June, 2015 11:59:22 AM
  Subject: Re: [Gluster-devel] [Gluster-infra] NetBSD regressions not being
  triggered for patches
 
  cloud.gluster.org is served by Rackspace Cloud DNS. AFAICT, there is
  no readily available option to do zone transfers from it. We might
  have to contact the Rackspace support to find out if they can do it as
  a special request.
 
  If this is going to take time then I prefer not to block patches for
  NetBSD. We can address
  any NetBSD regression caused by patches as a separate bug. Otherwise our
  regression queue will
  continue to grow.
 +1 for this. We shouldn't be blocking patches for NetBSD regression till
 the infra scales enough to handle the kind of load we are throwing at
 it. Once the regression framework is scalable enough, we can fix any
 regressions (if any) introduced. This will bring down the turnaround
 time, for the patch acceptance.

+1


 
  On Wed, Jun 17, 2015 at 11:50 AM, Emmanuel Dreyfus m...@netbsd.org
  wrote:
  Venky Shankar yknev.shan...@gmail.com wrote:
 
  If that's the case, then I'll vote for this even if it takes some time
  to get things in workable state.
  See my other mail about this: you enter a new slave VM in the DNS and it
  does not resolve, or somethimes you get 20s delays. I am convinced this
  is the reason why Jenkins bugs.
 
  --
  Emmanuel Dreyfus
  http://hcpnet.free.fr/pubz
  m...@netbsd.org
  ___
  Gluster-infra mailing list
  Gluster-infra@gluster.org
  http://www.gluster.org/mailman/listinfo/gluster-infra
  ___
  Gluster-devel mailing list
  gluster-de...@gluster.org
  http://www.gluster.org/mailman/listinfo/gluster-devel
 
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-infra


Re: [Gluster-infra] [Gluster-devel] NetBSD tests not running to completion.

2016-01-07 Thread Nithya Balachandran
I agree. 

Regards,
Nithya

- Original Message -
> From: "Atin Mukherjee" 
> To: "Joseph Fernandes" , "Avra Sengupta" 
> 
> Cc: "Gluster Devel" , "gluster-infra" 
> 
> Sent: Thursday, January 7, 2016 1:53:47 PM
> Subject: Re: [Gluster-devel] [Gluster-infra] NetBSD tests not running to 
> completion.
> 
> I have been always with this approach right from the beginning. We can
> definitely have nightly if not weekly NetBSD regressions to sanitize the
> changes, with that model we wouldn't need to eliminate BSD support but
> we can avoid this hard dependency in patch acceptance which has haunted
> us *multiple* times now.
> 
> Thanks,
> Atin
> 
> On 01/07/2016 12:38 PM, Joseph Fernandes wrote:
> > +2 Avra
> > 
> > - Original Message -
> > From: "Avra Sengupta" 
> > To: "Gluster Devel" , "gluster-infra"
> > 
> > Sent: Thursday, January 7, 2016 11:51:51 AM
> > Subject: Re: [Gluster-infra] [Gluster-devel] NetBSD tests not running to
> > completion.
> > 
> > The same issue keeps coming up every few months, where all patch
> > acceptances comes to a grinding halt with a dependency on NetBSD
> > regressions. I have been re-trigerring my patches too, and they are not
> > completing. Not to mention the long wait queue for them to run in the
> > first place and then having them not complete.
> > 
> > I know this issue has been discussed many times before and every time we
> > have arrived at the conclusion that we need to have more stable tests,
> > or a more robust infrastructure, but there's more to it than that.
> > Here's listing down a few of the things I have observed:
> > 1. Not many people are well versed with debugging the issues, that
> > result in failure in NetBSD regression suites, simply because not many
> > of us are familiar with the nuances of the platform.
> > 2. If I am a developer interested in being a part of the gluster
> > community and contributing code to it, the patches I send will have
> > dependency on NetBSD regressions. When people who have been contributing
> > for years now are finding it cumbersome to have the NetBSD regressions
> > pass for their patches, imagine the impression and the impact on
> > motivation it will have on a new developer. We need to ask ourselves how
> > is this impacting a patch acceptance process.
> > 
> > We can atleast try a different approaches to tackle this problem instead
> > of just waiting for the test suite to stabilize or the infrastructure to
> > get better.
> > 1. We can have NetBSD as a separate port, and not have patches sent to
> > the master branch be dependent on it's regression.
> > 2. We can also have a nightly NetBSD regression run, instead of running
> > it per patch. If a particular regression test fails, the owner of the
> > test looks into it, and we debug the issue. One might say it's just
> > delaying the problem, but atleast we will not have all patches
> > acceptances blocked.
> > 3. We really need to trigger regressions only on the patches that have
> > been reviewed and have gotten a +2. This will substantially bring down
> > the wait time. I remember Atin bringing this up a few months back, but
> > it still hasn't been implemented. Can we please have this ASAP.
> > 
> > Regards,
> > Avra
> > 
> > On 01/06/2016 05:49 PM, Ravishankar N wrote:
> >> I re triggered NetBSD regressions for
> >> http://review.gluster.org/#/c/13041/3 but they are being run in silent
> >> mode and are not completing. Can some one from the infra-team take a
> >> look? The last 22 tests in
> >> https://build.gluster.org/job/rackspace-netbsd7-regression-triggered/
> >> have failed. Highly unlikely that something is wrong with all those
> >> patches.
> >>
> >> Thanks,
> >> Ravi
> >> ___
> >> Gluster-infra mailing list
> >> Gluster-infra@gluster.org
> >> http://www.gluster.org/mailman/listinfo/gluster-infra
> > 
> > ___
> > Gluster-infra mailing list
> > Gluster-infra@gluster.org
> > http://www.gluster.org/mailman/listinfo/gluster-infra
> > ___
> > Gluster-devel mailing list
> > gluster-de...@gluster.org
> > http://www.gluster.org/mailman/listinfo/gluster-devel
> > 
> ___
> Gluster-devel mailing list
> gluster-de...@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
> 
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-infra


Re: [Gluster-infra] [Gluster-devel] Jenkins Issues this weekend and how we're solving them

2018-02-19 Thread Nithya Balachandran
On 19 February 2018 at 13:12, Atin Mukherjee  wrote:

>
>
> On Mon, Feb 19, 2018 at 8:53 AM, Nigel Babu  wrote:
>
>> Hello,
>>
>> As you all most likely know, we store the tarball of the binaries and
>> core if there's a core during regression. Occasionally, we've introduced a
>> bug in Gluster and this tar can take up a lot of space. This has happened
>> recently with brick multiplex tests. The build-install tar takes up 25G,
>> causing the machine to run out of space and continuously fail.
>>
>
> AFAIK, we don't have a .t file in upstream regression suits where hundreds
> of volumes are created. With that scale and brick multiplexing enabled, I
> can understand the core will be quite heavy loaded and may consume up to
> this much of crazy amount of space. FWIW, can we first try to figure out
> which test was causing this crash and see if running a gcore after a
> certain steps in the tests do left us with a similar size of the core file?
> IOW, have we actually seen such huge size of core file generated earlier?
> If not, what changed because which we've started seeing this is something
> to be invested on.
>

We also need to check if this is only the core file that is causing the
increase in size or whether there is something else that is taking up a lot
of space.

>
>
>>
>> I've made some changes this morning. Right after we create the tarball,
>> we'll delete all files in /archive that are greater than 1G. Please be
>> aware that this means all large files including the newly created tarball
>> will be deleted. You will have to work with the traceback on the Jenkins
>> job.
>>
>
> We'd really need to first investigate on the average size of the core file
> what we can get with when a system is running with brick multiplexing and
> ongoing I/O. With out that immediately deleting the core files > 1G will
> cause trouble to the developers in debugging genuine crashes as traceback
> alone may not be sufficient.
>
>
>>
>>
>>
>> --
>> nigelb
>>
>> ___
>> Gluster-devel mailing list
>> gluster-de...@gluster.org
>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>>
>
>
> ___
> Gluster-devel mailing list
> gluster-de...@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] [Gluster-devel] bug-1432542-mpx-restart-crash.t failing

2018-07-09 Thread Nithya Balachandran
We discussed reducing the number of volumes in the maintainers'
meeting.Should we still go ahead and do that?


On 9 July 2018 at 15:45, Xavi Hernandez  wrote:

> On Mon, Jul 9, 2018 at 11:14 AM Karthik Subrahmanya 
> wrote:
>
>> Hi Deepshikha,
>>
>> Are you looking into this failure? I can still see this happening for all
>> the regression runs.
>>
>
> I've executed the failing script on my laptop and all tests finish
> relatively fast. What seems to take time is the final cleanup. I can see
> 'semanage' taking some CPU during destruction of volumes. The test required
> 350 seconds to finish successfully.
>
> Not sure what caused the cleanup time to increase, but I've created a bug
> [1] to track this and a patch [2] to give more time to this test. This
> should allow all blocked regressions to complete successfully.
>
> Xavi
>
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1599250
> [2] https://review.gluster.org/20482
>
>
>> Thanks & Regards,
>> Karthik
>>
>> On Sun, Jul 8, 2018 at 7:18 AM Atin Mukherjee 
>> wrote:
>>
>>> https://build.gluster.org/job/regression-test-with-
>>> multiplex/794/display/redirect has the same test failing. Is the reason
>>> of the failure different given this is on jenkins?
>>>
>>> On Sat, 7 Jul 2018 at 19:12, Deepshikha Khandelwal 
>>> wrote:
>>>
 Hi folks,

 The issue[1] has been resolved. Now the softserve instance will be
 having 2GB RAM i.e. same as that of the Jenkins builder's sizing
 configurations.

 [1] https://github.com/gluster/softserve/issues/40

 Thanks,
 Deepshikha Khandelwal

 On Fri, Jul 6, 2018 at 6:14 PM, Karthik Subrahmanya <
 ksubr...@redhat.com> wrote:
 >
 >
 > On Fri 6 Jul, 2018, 5:18 PM Deepshikha Khandelwal, <
 dkhan...@redhat.com>
 > wrote:
 >>
 >> Hi Poornima/Karthik,
 >>
 >> We've looked into the memory error that this softserve instance have
 >> showed up. These machine instances have 1GB RAM which is not in the
 >> case with the Jenkins builder. It's 2GB RAM there.
 >>
 >> We've created the issue [1] and will solve it sooner.
 >
 > Great. Thanks for the update.
 >>
 >>
 >> Sorry for the inconvenience.
 >>
 >> [1] https://github.com/gluster/softserve/issues/40
 >>
 >> Thanks,
 >> Deepshikha Khandelwal
 >>
 >> On Fri, Jul 6, 2018 at 3:44 PM, Karthik Subrahmanya <
 ksubr...@redhat.com>
 >> wrote:
 >> > Thanks Poornima for the analysis.
 >> > Can someone work on fixing this please?
 >> >
 >> > ~Karthik
 >> >
 >> > On Fri, Jul 6, 2018 at 3:17 PM Poornima Gurusiddaiah
 >> > 
 >> > wrote:
 >> >>
 >> >> The same test case is failing for my patch as well [1]. I
 requested for
 >> >> a
 >> >> regression system and tried to reproduce it.
 >> >> From my analysis, the brick process (mutiplexed) is consuming a
 lot of
 >> >> memory, and is being OOM killed. The regression has 1GB ram and
 the
 >> >> process
 >> >> is consuming more than 1GB. 1GB for 120 bricks is acceptable
 >> >> considering
 >> >> there is 1000 threads in that brick process.
 >> >> Ways to fix:
 >> >> - Increase the regression system RAM size OR
 >> >> - Decrease the number of volumes in the test case.
 >> >>
 >> >> But what is strange is why the test passes sometimes for some
 patches.
 >> >> There could be some bug/? in memory consumption.
 >> >>
 >> >> Regards,
 >> >> Poornima
 >> >>
 >> >>
 >> >> On Fri, Jul 6, 2018 at 2:11 PM, Karthik Subrahmanya
 >> >> 
 >> >> wrote:
 >> >>>
 >> >>> Hi,
 >> >>>
 >> >>> $subject is failing on centos regression for most of the patches
 with
 >> >>> timeout error.
 >> >>>
 >> >>> 07:32:34
 >> >>>
 >> >>> 
 
 >> >>> 07:32:34 [07:33:05] Running tests in file
 >> >>> ./tests/bugs/core/bug-1432542-mpx-restart-crash.t
 >> >>> 07:32:34 Timeout set is 300, default 200
 >> >>> 07:37:34 ./tests/bugs/core/bug-1432542-mpx-restart-crash.t
 timed out
 >> >>> after 300 seconds
 >> >>> 07:37:34 ./tests/bugs/core/bug-1432542-mpx-restart-crash.t: bad
 status
 >> >>> 124
 >> >>> 07:37:34
 >> >>> 07:37:34*
 >> >>> 07:37:34*   REGRESSION FAILED   *
 >> >>> 07:37:34* Retrying failed tests in case *
 >> >>> 07:37:34* we got some spurious failures *
 >> >>> 07:37:34*
 >> >>> 07:37:34
 >> >>> 07:42:34 ./tests/bugs/core/bug-1432542-mpx-restart-crash.t
 timed out
 >> >>> after 300 seconds
 >> >>> 07:42:34 End of test ./tests/bugs/core/bug-1432542-
 mpx-restart-crash.t
 >> >>> 07:42:34
 >> >>>
 >> >>> 

Re: [Gluster-infra] [Gluster-devel] rebal-all-nodes-migrate.t always fails now

2019-04-05 Thread Nithya Balachandran
On Fri, 5 Apr 2019 at 12:16, Michael Scherer  wrote:

> Le jeudi 04 avril 2019 à 18:24 +0200, Michael Scherer a écrit :
> > Le jeudi 04 avril 2019 à 19:10 +0300, Yaniv Kaul a écrit :
> > > I'm not convinced this is solved. Just had what I believe is a
> > > similar
> > > failure:
> > >
> > > *00:12:02.532* A dependency job for rpc-statd.service failed. See
> > > 'journalctl -xe' for details.*00:12:02.532* mount.nfs: rpc.statd is
> > > not running but is required for remote locking.*00:12:02.532*
> > > mount.nfs: Either use '-o nolock' to keep locks local, or start
> > > statd.*00:12:02.532* mount.nfs: an incorrect mount option was
> > > specified
> > >
> > > (of course, it can always be my patch!)
> > >
> > > https://build.gluster.org/job/centos7-regression/5384/console
> >
> > same issue, different builder (206). I will check them all, as the
> > issue is more widespread than I expected (or it did popup since last
> > time I checked).
>
> Deepshika did notice that the issue came back on one server
> (builder202) after a reboot, so the rpcbind issue is not related to the
> network initscript one, so the RCA continue.
>
> We are looking for another workaround involving fiddling with the
> socket (until we find why it do use ipv6 at boot, but not after, when
> ipv6 is disabled).
>

Could this be relevant?
https://access.redhat.com/solutions/2798411


>
> Maybe we could run the test suite on a node without all the ipv6
> disabling to see if that cause a issue ?
>
> --
> Michael Scherer
> Sysadmin, Community Infrastructure and Platform, OSAS
>
>
> ___
> Gluster-devel mailing list
> gluster-de...@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra