Re: [Gluster-infra] [Gluster-devel] Update on georep failure

2021-02-02 Thread Yaniv Kaul
On Tue, Feb 2, 2021 at 8:14 PM Michael Scherer  wrote:

> Hi,
>
> so we finally found the cause of the georep failure, after several days
> of work from Deepshika and I.
>
> Short story:
> 
>
> side effect of adding libtirpc-devel on EL 7:
> https://github.com/gluster/project-infrastructure/issues/115


Looking at
https://github.com/gluster/glusterfs-patch-acceptance-tests/pull/191 - we
weren't supposed to use it?
From
https://github.com/gluster/glusterfs/blob/d1d7a6f35c816822fab51c820e25023863c239c1/glusterfs.spec.in#L61
:
# Do not use libtirpc on EL6, it does not have xdr_uint64_t() and
xdr_uint32_t
# Do not use libtirpc on EL7, it does not have xdr_sizeof()
%if ( 0%{?rhel} && 0%{?rhel} <= 7 )
%global _without_libtirpc --without-libtirpc
%endif


CentOS 7 has an ancient version, CentOS 8 has a newer version, so perhaps
just one CentOS 8 slaves?
Y.


>
> Long story:
> ===
>
> So we first puzzled on why it was failing just on some builders and not
> others, especially since it was working fine on softserve VMs.
>
> We tried to look for the usual suspect, rebooted, reinstalled, searched
> if there was something weird (too much ssh keys, not enough inode, some
> hardware issue), but nothing obvious.
>
> After trying to find my way in the logs file and a few weird leads
> (like, why gsyncd was running gcc ? (answer: ctypes)), I was left with
> a rather cryptic message:
>
> [2021-02-02 15:19:00.040817 +] I
> [socket.c:929:__socket_server_bind] 0-socket.gfchangelog: closing
> (AF_UNIX) reuse check socket 18
> [2021-02-02 15:19:02.041641 +] W [xdr-
> rpcclnt.c:68:rpc_request_to_xdr] 0-rpc: failed to encode call msg
> [2021-02-02 15:19:02.041673 +] E [rpc-
> clnt.c:1507:rpc_clnt_record_build_record] 0-gfchangelog: Failed to
> build record header
> [2021-02-02 15:19:02.041683 +] W [rpc-clnt.c:1664:rpc_clnt_submit]
> 0-gfchangelog: cannot build rpc-record
> [2021-02-02 15:19:02.041692 +] E [MSGID: 132023] [gf-
> changelog.c:285:gf_changelog_setup_rpc] 0-gfchangelog: Could not
> initiate probe RPC, bailing out!!!
> [2021-02-02 15:19:02.041809 +] E [MSGID: 132022] [gf-
> changelog.c:583:gf_changelog_register_generic] 0-gfchangelog: Error
> registering with changelog xlator
>
> Given that all gluster is around RPC, it would be unlikely that rpc is
> broken, but that's the only messages we had.
>
>
> We also found that the only builder that was working was builder 210.
> Upon looking, we found that 210 failed to be updated with ansible, due
> to some debugging we forgot to revert, which made this task fail:
>
> https://github.com/gluster/gluster.org_ansible_configuration/blob/master/roles/gluster_qa_scripts/tasks/main.yml#L7
>
> But it wasn't clear how that would change anything, since the only diff
> was a "set -e" that wasn't removed.
>
> Then Deepshika started to test more than georep, and she noticed that a
> lot of others tests were failing, with the same exact message about
> rpc.
>
> And she started to wonder if anything was recently changed. And indeed:
>
> # rpm -qa --last | head -n 15
> yum-plugin-auto-update-debug-info-1.1.31-54.el7_8.noarch mar. 02 févr.
> 2021 14:04:59 UTC
> python3-debuginfo-3.6.8-18.el7.x86_64 mar. 02 févr. 2021
> 14:04:58 UTC
> glibc-debuginfo-2.17-317.el7.x86_64   mar. 02 févr. 2021
> 14:04:57 UTC
> glibc-debuginfo-common-2.17-317.el7.x86_64mar. 02 févr. 2021
> 14:04:53 UTC
> gpg-pubkey-b6792c39-53c4fbdd  mar. 02 févr. 2021
> 14:04:34 UTC
> tzdata-java-2021a-1.el7.noarchmer. 27 janv. 2021
> 09:09:27 UTC
> tzdata-2021a-1.el7.noarch mer. 27 janv. 2021
> 09:09:26 UTC
> sudo-1.8.23-10.el7_9.1.x86_64 mer. 27 janv. 2021
> 09:09:26 UTC
> libtirpc-devel-0.2.4-0.16.el7.x86_64  mar. 26 janv. 2021
> 12:53:45 UTC
> java-1.8.0-openjdk-1.8.0.282.b08-1.el7_9.x86_64 mar. 26 janv. 2021
> 05:06:44 UTC
>
> We added libtirpc-devel on the 26/01.
>
> libtirpc-devel would, as the name imply, change something around the
> rpc subsystem.
>
> It happened around last week, when we started to notice the problem.
>
> It was not applied to 210, because 210 failed before it got to that
> point (since ansible stop as soon as the git update failed, and jenkins
> builder role is after the gluster-qa-script update).
>
> It was not applied to softserve provided VM either, so tests where
> working fine there.
>
> And indeed, once the package got removed, the tests were working again.
>
> Follow up
> =
>
> So, I would like to know exactly what should be tested. Is gluster not
> compatible with libtirpc on C7 (as it work on C8), or is there some
> weird issue ? (cause from what I remember, RPC format is supposed to be
> compatible and covered by a specification)
>
> Should we test on C8 only ?
>
>
> --
> Michael Scherer / He/Il/Er/Él
> Sysadmin, Community Infrastructure
>
>
>
> ---
>
> Community Meeting Calendar:
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 

Re: [Gluster-infra] Infra & Tests: Few requests

2020-01-03 Thread Yaniv Kaul
On Fri, Jan 3, 2020 at 4:07 PM Amar Tumballi  wrote:

> Hi Team,
>
> First thing first - Happy 2020 !! Hope this year will be great for all of
> us :-)
>
> Few requests to begin the new year!
>
> 1. Lets please move all the fedora builders to F31.
>- There can be some warnings with F31, so we can start with 'skip' mode
> and once fixed enabled to vote.
>

I use F31 and things seem OK.
Probably worth adding / moving to CentOS 8 (streams?!).


> 2. Failures to smoke due to devrpm-el.
>- I was not able to get the info just by looking at console logs, and
> other things. It is not the previous glitches of Build root locked by
> another process error.
>- Would be great to get this resolved, so we can merge some good
> patches.
>

It's being worked on.

>
> 3. Random failures in centos-regression.
>- Again, I am not sure if someone looking into this.
>- I have noticed tests like 
> './tests/basic/distribute/rebal-all-nodes-migrate.t'
> etc failing on few machines.
>

Same, I think.
I know a patch of mine broke CI - but was reverted - do you still see
regular failures, or just random? Per node?
Y.

>
> Thanks in advance!
>
> Regards,
> Amar
>
>
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Something wrong with the Jenkins slaves for CentOS?

2019-10-22 Thread Yaniv Kaul
Looking at the CentOS regression job, I see several of these errors:
(pending—‘Jenkins’ doesn’t have label ‘centos7’; ‘
bugziller.int.rht.gluster.org’ doesn’t have label ‘centos7’; ‘
builder0.int.rht.gluster.org’ doesn’t have label ‘centos7’; ‘
builder1.int.rht.gluster.org’ doesn’t have label ‘centos7’; ‘
builder10.int.rht.gluster.org’ doesn’t have label ‘centos7’; ‘
builder11.int.rht.gluster.org’ doesn’t have label ‘centos7’; ‘
builder12.int.rht.gluster.org’ doesn’t have label ‘centos7’; ‘
builder13.int.rht.gluster.org’ doesn’t have label ‘centos7’; ‘
builder14.int.rht.gluster.org’ doesn’t have label ‘centos7’; ‘
builder15.int.rht.gluster.org’ doesn’t have label ‘centos7’; ‘
builder16.int.rht.gluster.org’ doesn’t have label ‘centos7’; ‘
builder17.int.rht.gluster.org’ doesn’t have label ‘centos7’;
builder18.int.rht.gluster.org
 is
offline; builder19.int.rht.gluster.org
 is
offline; ‘builder2.int.rht.gluster.org’ doesn’t have label ‘centos7’; ‘
builder20.int.rht.gluster.org’ doesn’t have label ‘centos7’;
builder201.aws.gluster.org
 is offline;
builder204.aws.gluster.org
 is offline;
builder206.aws.gluster.org
 is offline;
builder207.aws.gluster.org
 is offline;
builder208.aws.gluster.org
 is offline;
‘builder21.int.rht.gluster.org’ doesn’t have label ‘centos7’; ‘
builder22.int.rht.gluster.org’ doesn’t have label ‘centos7’; ‘
builder23.int.rht.gluster.org’ doesn’t have label ‘centos7’; ‘
builder24.int.rht.gluster.org’ doesn’t have label ‘centos7’; ‘
builder25.int.rht.gluster.org’ doesn’t have label ‘centos7’; ‘
builder26.int.rht.gluster.org’ doesn’t have label ‘centos7’; ‘
builder27.int.rht.gluster.org’ doesn’t have label ‘centos7’; ‘
builder28.int.rht.gluster.org’ doesn’t have label ‘centos7’; ‘
builder29.int.rht.gluster.org’ doesn’t have label ‘centos7’; ‘
builder3.int.rht.gluster.org’ doesn’t have label ‘centos7’; ‘
builder30.int.rht.gluster.org’ doesn’t have label ‘centos7’; ‘
builder31.int.rht.gluster.org’ doesn’t have label ‘centos7’; ‘
builder32.int.rht.gluster.org’ doesn’t have label ‘centos7’; ‘
builder33.int.rht.gluster.org’ doesn’t have label ‘centos7’; ‘
builder34.int.rht.gluster.org’ doesn’t have label ‘centos7’; ‘
builder35.int.rht.gluster.org’ doesn’t have label ‘centos7’; ‘
builder36.int.rht.gluster.org’ doesn’t have label ‘centos7’; ‘
builder37.int.rht.gluster.org’ doesn’t have label ‘centos7’; ‘
builder38.int.rht.gluster.org’ doesn’t have label ‘centos7’; ‘
builder39.int.rht.gluster.org’ doesn’t have label ‘centos7’; ‘
builder4.int.rht.gluster.org’ doesn’t have label ‘centos7’; ‘
builder40.int.rht.gluster.org’ doesn’t have label ‘centos7’; ‘
builder41.int.rht.gluster.org’ doesn’t have label ‘centos7’; ‘
builder42.int.rht.gluster.org’ doesn’t have label ‘centos7’; ‘
builder43.int.rht.gluster.org’ doesn’t have label ‘centos7’; ‘
builder44.int.rht.gluster.org’ doesn’t have label ‘centos7’; ‘
builder45.int.rht.gluster.org’ doesn’t have label ‘centos7’; ‘
builder46.int.rht.gluster.org’ doesn’t have label ‘centos7’; ‘
builder47.int.rht.gluster.org’ doesn’t have label ‘centos7’; ‘
builder48.int.rht.gluster.org’ doesn’t have label ‘centos7’; ‘
builder49.int.rht.gluster.org’ doesn’t have label ‘centos7’; ‘
builder5.int.rht.gluster.org’ doesn’t have label ‘centos7’; ‘
builder50.int.rht.gluster.org’ doesn’t have label ‘centos7’; ‘
builder6.int.rht.gluster.org’ doesn’t have label ‘centos7’; ‘
builder7.int.rht.gluster.org’ doesn’t have label ‘centos7’; ‘
builder8.int.rht.gluster.org’ doesn’t have label ‘centos7’; ‘
builder9.int.rht.gluster.org’ doesn’t have label ‘centos7’; ‘
freebsd10.3.int.rht.gluster.org’ doesn’t have label ‘centos7’)

#8239
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] [Gluster-devel] rebal-all-nodes-migrate.t always fails now

2019-06-04 Thread Yaniv Kaul
What was the result of this investigation? I suspect seeing the same issue
on builder209[1].
Y.

[1] https://build.gluster.org/job/centos7-regression/6302/consoleFull

On Fri, Apr 5, 2019 at 5:40 PM Michael Scherer  wrote:

> Le vendredi 05 avril 2019 à 16:55 +0530, Nithya Balachandran a écrit :
> > On Fri, 5 Apr 2019 at 12:16, Michael Scherer 
> > wrote:
> >
> > > Le jeudi 04 avril 2019 à 18:24 +0200, Michael Scherer a écrit :
> > > > Le jeudi 04 avril 2019 à 19:10 +0300, Yaniv Kaul a écrit :
> > > > > I'm not convinced this is solved. Just had what I believe is a
> > > > > similar
> > > > > failure:
> > > > >
> > > > > *00:12:02.532* A dependency job for rpc-statd.service failed.
> > > > > See
> > > > > 'journalctl -xe' for details.*00:12:02.532* mount.nfs:
> > > > > rpc.statd is
> > > > > not running but is required for remote locking.*00:12:02.532*
> > > > > mount.nfs: Either use '-o nolock' to keep locks local, or start
> > > > > statd.*00:12:02.532* mount.nfs: an incorrect mount option was
> > > > > specified
> > > > >
> > > > > (of course, it can always be my patch!)
> > > > >
> > > > > https://build.gluster.org/job/centos7-regression/5384/console
> > > >
> > > > same issue, different builder (206). I will check them all, as
> > > > the
> > > > issue is more widespread than I expected (or it did popup since
> > > > last
> > > > time I checked).
> > >
> > > Deepshika did notice that the issue came back on one server
> > > (builder202) after a reboot, so the rpcbind issue is not related to
> > > the
> > > network initscript one, so the RCA continue.
> > >
> > > We are looking for another workaround involving fiddling with the
> > > socket (until we find why it do use ipv6 at boot, but not after,
> > > when
> > > ipv6 is disabled).
> > >
> >
> > Could this be relevant?
> > https://access.redhat.com/solutions/2798411
>
> Good catch.
>
> So, we already do that, Nigel took care of that (after 2 days of
> research). But I didn't knew the exact symptoms, and decided to double
> check just in case.
>
> And... there is no sysctl.conf in the initrd. Running dracut -v -f do
> not change anything.
>
> Running "dracut -v -f -H" take care of that (and this fix the problem),
> but:
> - our ansible script already run that
> - -H is hostonly, which is already the default on EL7 according to the
> doc.
>
> However, if dracut-config-generic is installed, it doesn't build a
> hostonly initrd, and so do not include the sysctl.conf file (who break
> rpcbnd, who break the test suite).
>
> And for some reason, it is installed the image in ec2 (likely default),
> but not by default on the builders.
>
> So what happen is that after a kernel upgrade, dracut rebuild a generic
> initrd instead of a hostonly one, who break things. And kernel was
> likely upgraded recently (and upgrade happen nightly (for some value of
> "night"), so we didn't see that earlier, nor with a fresh system.
>
>
> So now, we have several solution:
> - be explicit on using hostonly in dracut, so this doesn't happen again
> (or not for this reason)
>
> - disable ipv6 in rpcbind in a cleaner way (to be tested)
>
> - get the test suite work with ip v6
>
> In the long term, I also want to monitor the processes, but for that, I
> need a VPN between the nagios server and ec2, and that project got
> blocked by several issues (like EC2 not support ecdsa keys, and we use
> that for ansible, so we have to come back to RSA for full automated
> deployment, and openvon requires to use certificates, so I need a newer
> python openssl for doing what I want, and RHEL 7 is too old, etc, etc).
>
> As the weekend approach for me, I just rebuilt the initrd for the time
> being. I guess forcing hostonly is the safest fix for now, but this
> will be for monday.
> --
> Michael Scherer
> Sysadmin, Community Infrastructure and Platform, OSAS
>
>
>
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] Do we have a monitoring system on our builders?

2019-04-27 Thread Yaniv Kaul
I'd like to see what is our status.
Just had CI failures[1] because builder26.int.rht.gluster.org is not
available, apparently.

TIA,
Y.

[1] https://build.gluster.org/job/devrpm-el7/15846/console
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] [Gluster-devel] is_nfs_export_available from nfs.rc failing too often?

2019-04-24 Thread Yaniv Kaul
On Tue, Apr 23, 2019 at 5:15 PM Michael Scherer  wrote:

> Le lundi 22 avril 2019 à 22:57 +0530, Atin Mukherjee a écrit :
> > Is this back again? The recent patches are failing regression :-\ .
>
> So, on builder206, it took me a while to find that the issue is that
> nfs (the service) was running.
>
> ./tests/basic/afr/tarissue.t failed, because the nfs initialisation
> failed with a rather cryptic message:
>
> [2019-04-23 13:17:05.371733] I [socket.c:991:__socket_server_bind] 0-
> socket.nfs-server: process started listening on port (38465)
> [2019-04-23 13:17:05.385819] E [socket.c:972:__socket_server_bind] 0-
> socket.nfs-server: binding to  failed: Address already in use
> [2019-04-23 13:17:05.385843] E [socket.c:974:__socket_server_bind] 0-
> socket.nfs-server: Port is already in use
> [2019-04-23 13:17:05.385852] E [socket.c:3788:socket_listen] 0-
> socket.nfs-server: __socket_server_bind failed;closing socket 14
>
> I found where this came from, but a few stuff did surprised me:
>
> - the order of print is different that the order in the code
>

Indeed strange...

> - the message on "started listening" didn't take in account the fact
> that bind failed on:
>

Shouldn't it bail out if it failed to bind?
Some missing 'goto out' around line 975/976?
Y.

>
>
>
> https://github.com/gluster/glusterfs/blob/master/rpc/rpc-transport/socket/src/socket.c#L967
>
> The message about port 38465 also threw me off the track. The real
> issue is that the service nfs was already running, and I couldn't find
> anything listening on port 38465
>
> once I do service nfs stop, it no longer failed.
>
> So far, I do know why nfs.service was activated.
>
> But at least, 206 should be fixed, and we know a bit more on what would
> be causing some failure.
>
>
>
> > On Wed, 3 Apr 2019 at 19:26, Michael Scherer 
> > wrote:
> >
> > > Le mercredi 03 avril 2019 à 16:30 +0530, Atin Mukherjee a écrit :
> > > > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan <
> > > > jthot...@redhat.com>
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > is_nfs_export_available is just a wrapper around "showmount"
> > > > > command AFAIR.
> > > > > I saw following messages in console output.
> > > > >  mount.nfs: rpc.statd is not running but is required for remote
> > > > > locking.
> > > > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks local,
> > > > > or
> > > > > start
> > > > > statd.
> > > > > 05:06:55 mount.nfs: an incorrect mount option was specified
> > > > >
> > > > > For me it looks rpcbind may not be running on the machine.
> > > > > Usually rpcbind starts automatically on machines, don't know
> > > > > whether it
> > > > > can happen or not.
> > > > >
> > > >
> > > > That's precisely what the question is. Why suddenly we're seeing
> > > > this
> > > > happening too frequently. Today I saw atleast 4 to 5 such
> > > > failures
> > > > already.
> > > >
> > > > Deepshika - Can you please help in inspecting this?
> > >
> > > So we think (we are not sure) that the issue is a bit complex.
> > >
> > > What we were investigating was nightly run fail on aws. When the
> > > build
> > > crash, the builder is restarted, since that's the easiest way to
> > > clean
> > > everything (since even with a perfect test suite that would clean
> > > itself, we could always end in a corrupt state on the system, WRT
> > > mount, fs, etc).
> > >
> > > In turn, this seems to cause trouble on aws, since cloud-init or
> > > something rename eth0 interface to ens5, without cleaning to the
> > > network configuration.
> > >
> > > So the network init script fail (because the image say "start eth0"
> > > and
> > > that's not present), but fail in a weird way. Network is
> > > initialised
> > > and working (we can connect), but the dhclient process is not in
> > > the
> > > right cgroup, and network.service is in failed state. Restarting
> > > network didn't work. In turn, this mean that rpc-statd refuse to
> > > start
> > > (due to systemd dependencies), which seems to impact various NFS
> > > tests.
> > >
> > > We have also seen that on some builders, rpcbind pick some IP v6
> > > autoconfiguration, but we can't reproduce that, and there is no ip
> > > v6
> > > set up anywhere. I suspect the network.service failure is somehow
> > > involved, but fail to see how. In turn, rpcbind.socket not starting
> > > could cause NFS test troubles.
> > >
> > > Our current stop gap fix was to fix all the builders one by one.
> > > Remove
> > > the config, kill the rogue dhclient, restart network service.
> > >
> > > However, we can't be sure this is going to fix the problem long
> > > term
> > > since this only manifest after a crash of the test suite, and it
> > > doesn't happen so often. (plus, it was working before some day in
> > > the
> > > past, when something did make this fail, and I do not know if
> > > that's a
> > > system upgrade, or a test change, or both).
> > >
> > > So we are still looking at it to have a complete understanding of
> > > the
> > > 

Re: [Gluster-infra] [Gluster-devel] rebal-all-nodes-migrate.t always fails now

2019-04-05 Thread Yaniv Kaul
On Fri, Apr 5, 2019 at 9:55 AM Deepshikha Khandelwal 
wrote:

>
>
> On Fri, Apr 5, 2019 at 12:16 PM Michael Scherer 
> wrote:
>
>> Le jeudi 04 avril 2019 à 18:24 +0200, Michael Scherer a écrit :
>> > Le jeudi 04 avril 2019 à 19:10 +0300, Yaniv Kaul a écrit :
>> > > I'm not convinced this is solved. Just had what I believe is a
>> > > similar
>> > > failure:
>> > >
>> > > *00:12:02.532* A dependency job for rpc-statd.service failed. See
>> > > 'journalctl -xe' for details.*00:12:02.532* mount.nfs: rpc.statd is
>> > > not running but is required for remote locking.*00:12:02.532*
>> > > mount.nfs: Either use '-o nolock' to keep locks local, or start
>> > > statd.*00:12:02.532* mount.nfs: an incorrect mount option was
>> > > specified
>> > >
>> > > (of course, it can always be my patch!)
>> > >
>> > > https://build.gluster.org/job/centos7-regression/5384/console
>> >
>> > same issue, different builder (206). I will check them all, as the
>> > issue is more widespread than I expected (or it did popup since last
>> > time I checked).
>>
>> Deepshika did notice that the issue came back on one server
>> (builder202) after a reboot, so the rpcbind issue is not related to the
>> network initscript one, so the RCA continue.
>>
>> We are looking for another workaround involving fiddling with the
>> socket (until we find why it do use ipv6 at boot, but not after, when
>> ipv6 is disabled).
>>
>> Maybe we could run the test suite on a node without all the ipv6
>> disabling to see if that cause a issue ?
>>
> Do our test regression suit started supporting ipv6 now? Else this
> investigation would lead to further issues.
>

I suspect not yet. But we certainly would like to, at some point, to ensure
we run with IPv6 as well!
Y.

> --
>> Michael Scherer
>> Sysadmin, Community Infrastructure and Platform, OSAS
>>
>>
>> ___
>> Gluster-infra mailing list
>> Gluster-infra@gluster.org
>> https://lists.gluster.org/mailman/listinfo/gluster-infra
>>
>
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] [Gluster-devel] rebal-all-nodes-migrate.t always fails now

2019-04-04 Thread Yaniv Kaul
I'm not convinced this is solved. Just had what I believe is a similar
failure:

*00:12:02.532* A dependency job for rpc-statd.service failed. See
'journalctl -xe' for details.*00:12:02.532* mount.nfs: rpc.statd is
not running but is required for remote locking.*00:12:02.532*
mount.nfs: Either use '-o nolock' to keep locks local, or start
statd.*00:12:02.532* mount.nfs: an incorrect mount option was
specified

(of course, it can always be my patch!)

https://build.gluster.org/job/centos7-regression/5384/console


On Thu, Apr 4, 2019 at 6:56 PM Atin Mukherjee  wrote:

> Thanks misc. I have always seen a pattern that on a reattempt (recheck
> centos) the same builder is picked up many time even though it's promised
> to pick up the builders in a round robin manner.
>
> On Thu, Apr 4, 2019 at 7:24 PM Michael Scherer 
> wrote:
>
>> Le jeudi 04 avril 2019 à 15:19 +0200, Michael Scherer a écrit :
>> > Le jeudi 04 avril 2019 à 13:53 +0200, Michael Scherer a écrit :
>> > > Le jeudi 04 avril 2019 à 16:13 +0530, Atin Mukherjee a écrit :
>> > > > Based on what I have seen that any multi node test case will fail
>> > > > and
>> > > > the
>> > > > above one is picked first from that group and If I am correct
>> > > > none
>> > > > of
>> > > > the
>> > > > code fixes will go through the regression until this is fixed. I
>> > > > suspect it
>> > > > to be an infra issue again. If we look at
>> > > > https://review.gluster.org/#/c/glusterfs/+/22501/ &
>> > > > https://build.gluster.org/job/centos7-regression/5382/ peer
>> > > > handshaking is
>> > > > stuck as 127.1.1.1 is unable to receive a response back, did we
>> > > > end
>> > > > up
>> > > > having firewall and other n/w settings screwed up? The test never
>> > > > fails
>> > > > locally.
>> > >
>> > > The firewall didn't change, and since the start has a line:
>> > > "-A INPUT -i lo -j ACCEPT", so all traffic on the localhost
>> > > interface
>> > > work. (I am not even sure that netfilter do anything meaningful on
>> > > the
>> > > loopback interface, but maybe I am wrong, and not keen on looking
>> > > kernel code for that).
>> > >
>> > >
>> > > Ping seems to work fine as well, so we can exclude a routing issue.
>> > >
>> > > Maybe we should look at the socket, does it listen to a specific
>> > > address or not ?
>> >
>> > So, I did look at the 20 first ailure, removed all not related to
>> > rebal-all-nodes-migrate.t and seen all were run on builder203, who
>> > was
>> > freshly reinstalled. As Deepshika noticed today, this one had a issue
>> > with ipv6, the 2nd issue we were tracking.
>> >
>> > Summary, rpcbind.socket systemd unit listen on ipv6 despites ipv6
>> > being
>> > disabled, and the fix is to reload systemd. We have so far no idea on
>> > why it happen, but suspect this might be related to the network issue
>> > we did identify, as that happen only after a reboot, that happen only
>> > if a build is cancelled/crashed/aborted.
>> >
>> > I apply the workaround on builder203, so if the culprit is that
>> > specific issue, guess that's fixed.
>> >
>> > I started a test to see how it go:
>> > https://build.gluster.org/job/centos7-regression/5383/
>>
>> The test did just pass, so I would assume the problem was local to
>> builder203. Not sure why it was always selected, except because this
>> was the only one that failed, so was always up for getting new jobs.
>>
>> Maybe we should increase the number of builder so this doesn't happen,
>> as I guess the others builders were busy at that time ?
>>
>> --
>> Michael Scherer
>> Sysadmin, Community Infrastructure and Platform, OSAS
>>
>>
>> ___
> Gluster-devel mailing list
> gluster-de...@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] [Gluster-devel] is_nfs_export_available from nfs.rc failing too often?

2019-04-03 Thread Yaniv Kaul
On Wed, Apr 3, 2019 at 2:53 PM Michael Scherer  wrote:

> Le mercredi 03 avril 2019 à 16:30 +0530, Atin Mukherjee a écrit :
> > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan 
> > wrote:
> >
> > > Hi,
> > >
> > > is_nfs_export_available is just a wrapper around "showmount"
> > > command AFAIR.
> > > I saw following messages in console output.
> > >  mount.nfs: rpc.statd is not running but is required for remote
> > > locking.
> > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks local, or
> > > start
> > > statd.
> > > 05:06:55 mount.nfs: an incorrect mount option was specified
> > >
> > > For me it looks rpcbind may not be running on the machine.
> > > Usually rpcbind starts automatically on machines, don't know
> > > whether it
> > > can happen or not.
> > >
> >
> > That's precisely what the question is. Why suddenly we're seeing this
> > happening too frequently. Today I saw atleast 4 to 5 such failures
> > already.
> >
> > Deepshika - Can you please help in inspecting this?
>
> So in the past, this kind of stuff did happen with ipv6, so this could
> be a change on AWS and/or a upgrade.
>

We need to enable IPv6, for two reasons:
1. IPv6 is common these days, even if we don't test with it, it should be
there.
2. We should test with IPv6...

I'm not sure, but I suspect we do disable IPv6 here and there. Example[1].
Y.

[1]
https://github.com/gluster/centosci/blob/master/jobs/scripts/glusto/setup-glusto.yml

>
> We are currently investigating a set of failure that happen after
> reboot (resulting in partial network bring up, causing all kind of
> weird issue), but it take some time to verify it, and since we lost 33%
> of the team with Nigel departure, stuff do not move as fast as before.
>
>
> --
> Michael Scherer
> Sysadmin, Community Infrastructure and Platform, OSAS
>
>
> ___
> Gluster-devel mailing list
> gluster-de...@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] [automated-testing] What is the current state of the Glusto test framework in upstream?

2019-04-01 Thread Yaniv Kaul
On Mon, Apr 1, 2019 at 8:56 AM Vijay Bhaskar Reddy Avuthu <
vavu...@redhat.com> wrote:

>
>
> On Sun, Mar 31, 2019 at 6:30 PM Yaniv Kaul  wrote:
>
>>
>>
>> On Wed, Mar 13, 2019 at 4:14 PM Jonathan Holloway 
>> wrote:
>>
>>>
>>>
>>> On Wed, Mar 13, 2019 at 5:08 AM Sankarshan Mukhopadhyay <
>>> sankarshan.mukhopadh...@gmail.com> wrote:
>>>
>>>> On Wed, Mar 13, 2019 at 3:03 PM Yaniv Kaul  wrote:
>>>> > On Wed, Mar 13, 2019, 3:53 AM Sankarshan Mukhopadhyay <
>>>> sankarshan.mukhopadh...@gmail.com> wrote:
>>>> >>
>>>> >> What I am essentially looking to understand is whether there are
>>>> >> regular Glusto runs and whether the tests receive refreshes. However,
>>>> >> if there is no available Glusto service running upstream - that is a
>>>> >> whole new conversation.
>>>> >
>>>> >
>>>> > I'm* still trying to get it running properly on my simple
>>>> Vagrant+Ansible setup[1].
>>>> > Right now I'm installing Gluster + Glusto + creating bricks, pool and
>>>> a volume in ~3m on my latop.
>>>> >
>>>>
>>>> This is good. I think my original question was to the maintainer(s) of
>>>> Glusto along with the individuals involved in the automated testing
>>>> part of Gluster to understand the challenges in deploying this for the
>>>> project.
>>>>
>>>> > Once I do get it fully working, we'll get to make it work faster,
>>>> clean it up and and see how can we get code coverage.
>>>> >
>>>> > Unless there's an alternative to the whole framework that I'm not
>>>> aware of?
>>>>
>>>> I haven't read anything to this effect on any list.
>>>>
>>>>
>>> This is cool. I haven't had a chance to give it a run on my laptop, but
>>> it looked good.
>>> Are you running into issues with Glusto, glusterlibs, and/or
>>> Glusto-tests?
>>>
>>
>> All of the above.
>> - The client consumes at times 100% CPU, not sure why.
>> - There are missing deps which I'm reverse engineering from Gluster CI
>> (which by itself has some strange deps - why do we need python-docx ?)
>> - I'm failing with the cvt test, with
>> test_shrinking_volume_when_io_in_progress with the error:
>> AssertionError: IO failed on some of the clients
>>
>> I had hoped it could give me a bit more hint:
>> - which clients? (I happen to have one, so that's easy)
>> - What IO workload?
>> - What error?
>>
>> - I hope there's a mode that does NOT perform cleanup/teardown, so it's
>> easier to look at the issue at hand.
>>
>
> python-docx needs to be installed as part of "glusto-tests dependencies".
> file_dir_ops.py supports writing docx files.
>

Anything special about docx files that we need to test with it? Have we
ever had some corruption specifically there? I'd understand (sort-of) if we
were running some application on top.
Anyway, this is not it - I have it installed already.

> IO failed on the client: 192.168.250.10. and it trying to write deep
> directories with files.
> Need to comment "tearDown" section if we want leave the cluster as it is
> in the failed state.
>

I would say that in CI, we probably want to continue, and elsewhere, we
probably want to stop.

>
>
>>
> - From glustomain.log, I can see:
>> 2019-03-31 12:56:00,627 INFO (validate_io_procs) Validating IO on
>> 192.168.250.10:/mnt/testvol_distributed-replicated_cifs
>> 2019-03-31 12:56:00,627 INFO (_log_results) ESC[34;1mRETCODE (
>> root@192.168.250.10): 1ESC[0m
>> 2019-03-31 12:56:00,628 INFO (_log_results) ESC[47;30;1mSTDOUT (
>> root@192.168.250.10)...
>> Starting File/Dir Ops: 12:55:27:PM:Mar_31_2019
>> Unable to create dir '/mnt/testvol_distributed-replicated_cifs/user6' :
>> Invalid argument
>> Unable to create dir
>> '/mnt/testvol_distributed-replicated_cifs/user6/dir0' : Invalid argument
>> Unable to create dir
>> '/mnt/testvol_distributed-replicated_cifs/user6/dir0/dir0' : Invalid
>> argument
>> Unable to create dir
>> '/mnt/testvol_distributed-replicated_cifs/user6/dir0/dir1' : Invalid
>> argument
>> Unable to create dir
>> '/mnt/testvol_distributed-replicated_cifs/user6/dir1' : Invalid argument
>> Unable to create dir
>> '/mnt/testvol_distributed-replicated_cifs/user6/dir1/dir0' : Invalid
>> argument
>>
>> I'm right now assuming so

Re: [Gluster-infra] [automated-testing] What is the current state of the Glusto test framework in upstream?

2019-03-31 Thread Yaniv Kaul
On Wed, Mar 13, 2019 at 4:14 PM Jonathan Holloway 
wrote:

>
>
> On Wed, Mar 13, 2019 at 5:08 AM Sankarshan Mukhopadhyay <
> sankarshan.mukhopadh...@gmail.com> wrote:
>
>> On Wed, Mar 13, 2019 at 3:03 PM Yaniv Kaul  wrote:
>> > On Wed, Mar 13, 2019, 3:53 AM Sankarshan Mukhopadhyay <
>> sankarshan.mukhopadh...@gmail.com> wrote:
>> >>
>> >> What I am essentially looking to understand is whether there are
>> >> regular Glusto runs and whether the tests receive refreshes. However,
>> >> if there is no available Glusto service running upstream - that is a
>> >> whole new conversation.
>> >
>> >
>> > I'm* still trying to get it running properly on my simple
>> Vagrant+Ansible setup[1].
>> > Right now I'm installing Gluster + Glusto + creating bricks, pool and a
>> volume in ~3m on my latop.
>> >
>>
>> This is good. I think my original question was to the maintainer(s) of
>> Glusto along with the individuals involved in the automated testing
>> part of Gluster to understand the challenges in deploying this for the
>> project.
>>
>> > Once I do get it fully working, we'll get to make it work faster, clean
>> it up and and see how can we get code coverage.
>> >
>> > Unless there's an alternative to the whole framework that I'm not aware
>> of?
>>
>> I haven't read anything to this effect on any list.
>>
>>
> This is cool. I haven't had a chance to give it a run on my laptop, but it
> looked good.
> Are you running into issues with Glusto, glusterlibs, and/or Glusto-tests?
>

All of the above.
- The client consumes at times 100% CPU, not sure why.
- There are missing deps which I'm reverse engineering from Gluster CI
(which by itself has some strange deps - why do we need python-docx ?)
- I'm failing with the cvt test, with
test_shrinking_volume_when_io_in_progress with the error:
AssertionError: IO failed on some of the clients

I had hoped it could give me a bit more hint:
- which clients? (I happen to have one, so that's easy)
- What IO workload?
- What error?

- I hope there's a mode that does NOT perform cleanup/teardown, so it's
easier to look at the issue at hand.

- From glustomain.log, I can see:
2019-03-31 12:56:00,627 INFO (validate_io_procs) Validating IO on
192.168.250.10:/mnt/testvol_distributed-replicated_cifs
2019-03-31 12:56:00,627 INFO (_log_results) ESC[34;1mRETCODE (
root@192.168.250.10): 1ESC[0m
2019-03-31 12:56:00,628 INFO (_log_results) ESC[47;30;1mSTDOUT (
root@192.168.250.10)...
Starting File/Dir Ops: 12:55:27:PM:Mar_31_2019
Unable to create dir '/mnt/testvol_distributed-replicated_cifs/user6' :
Invalid argument
Unable to create dir '/mnt/testvol_distributed-replicated_cifs/user6/dir0'
: Invalid argument
Unable to create dir
'/mnt/testvol_distributed-replicated_cifs/user6/dir0/dir0' : Invalid
argument
Unable to create dir
'/mnt/testvol_distributed-replicated_cifs/user6/dir0/dir1' : Invalid
argument
Unable to create dir '/mnt/testvol_distributed-replicated_cifs/user6/dir1'
: Invalid argument
Unable to create dir
'/mnt/testvol_distributed-replicated_cifs/user6/dir1/dir0' : Invalid
argument

I'm right now assuming something's wrong on my setup. Unclear what, yet.


> I was using the glusto-tests container to run tests locally and for BVT in
> the lab.
> I was running against lab VMs, so looking forward to giving the vagrant
> piece a go.
>
> By upstream service are we talking about the Jenkins in the CentOS
> environment, etc?
>

Yes.
Y.

@Vijay Bhaskar Reddy Avuthu  @Akarsha Rai
>  any insight?
>
> Cheers,
> Jonathan
>
> > Surely for most of the positive paths, we can (and perhaps should) use
>> the the Gluster Ansible modules.
>> > Y.
>> >
>> > [1] https://github.com/mykaul/vg
>> > * with an intern's help.
>> ___
>> automated-testing mailing list
>> automated-test...@gluster.org
>> https://lists.gluster.org/mailman/listinfo/automated-testing
>>
> ___
> automated-testing mailing list
> automated-test...@gluster.org
> https://lists.gluster.org/mailman/listinfo/automated-testing
>
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] [automated-testing] What is the current state of the Glusto test framework in upstream?

2019-03-13 Thread Yaniv Kaul
On Wed, Mar 13, 2019, 3:53 AM Sankarshan Mukhopadhyay <
sankarshan.mukhopadh...@gmail.com> wrote:

> What I am essentially looking to understand is whether there are
> regular Glusto runs and whether the tests receive refreshes. However,
> if there is no available Glusto service running upstream - that is a
> whole new conversation.
>

I'm* still trying to get it running properly on my simple Vagrant+Ansible
setup[1].
Right now I'm installing Gluster + Glusto + creating bricks, pool and a
volume in ~3m on my latop.

Once I do get it fully working, we'll get to make it work faster, clean it
up and and see how can we get code coverage.

Unless there's an alternative to the whole framework that I'm not aware of?
Surely for most of the positive paths, we can (and perhaps should) use the
the Gluster Ansible modules.
Y.

[1] https://github.com/mykaul/vg
* with an intern's help.

___
> automated-testing mailing list
> automated-test...@gluster.org
> https://lists.gluster.org/mailman/listinfo/automated-testing
>
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] [Gluster-devel] Reboot policy for the infra

2018-08-23 Thread Yaniv Kaul
On Thu, Aug 23, 2018 at 10:49 AM, Michael Scherer 
wrote:

> Le jeudi 23 août 2018 à 11:21 +0530, Nigel Babu a écrit :
> > One more piece that's missing is when we'll restart the physical
> > servers.
> > That seems to be entirely missing. The rest looks good to me and I'm
> > happy
> > to add an item to next sprint to automate the node rebooting.
>
> That's covered as "as critical as the services that depend on them.
>
> Now, the problem I do have is that some server (myrmicinae to name it)
> do take 30 minutes to reboot, and I can't diagnose nor fix without
> taking hours. This is the one running gerrit/jenkins, so that's not
> possible to spent time on this kind of test.
>

You'd imagine people would move to kexec reboots for VMs by now.
Not sure why it's not catching up.
(BTW, is it taking time to shutdown or to bring up?)
Y.


>
>
>
> > On Tue, Aug 21, 2018 at 9:56 PM Michael Scherer 
> > wrote:
> >
> > > Hi,
> > >
> > > so that's kernel reboot time again, this time courtesy of Intel
> > > (again). I do not consider the issue to be "OMG the sky is
> > > falling",
> > > but enough to take time to streamline our process to reboot.
> > >
> > >
> > >
> > > Currently, we do not have a policy or anything, and I think the
> > > negociation time around that is cumbersome:
> > > - we need to reach people, which take time and add latency (would
> > > be
> > > bad if that was a urgent issue, and likely add undeed stress while
> > > waiting)
> > >
> > > - we need to keep track of what was supposed to be done, which is
> > > also
> > > cumbersome
> > >
> > > While that's not a problem if I had only gluster to deal with, my
> > > team
> > > of 3 do have to deal with a few more projects than 1, and
> > > orchestrating
> > > choice for a dozen of group is time consuming (just think last time
> > > you
> > > had to go to a restaurant after a conference to see how hard it is
> > > to
> > > reach agreements).
> > >
> > > So I would propose that we simplify that with the following policy:
> > >
> > > - Jenkins builder would be reboot by jenkins on a regular basis.
> > > I do not know how we can do that, but given that we have enough
> > > node to
> > > sustain builds, it shouldn't impact developpers in a big way. The
> > > only
> > > exception is the freebsd builder, since we only have 1 functionnal
> > > at
> > > the moment. But once the 2nd is working, it should be treated like
> > > the
> > > others.
> > >
> > > - service in HA (firewall, reverse proxy, internal squid/DNS) would
> > > be
> > > reboot during the day without notice. Due to working HA, that's non
> > > user impacting. In fact, that's already what I do.
> > >
> > > - service not in HA should be pushed for HA (gerrit might get there
> > > one
> > > day, no way for jenkins :/, need to see for postgres and so
> > > fstat/softserve, and maybe try to get something for
> > > download.gluster.org)
> > >
> > > - service critical and not in HA should be announced in advance.
> > > Critical mean the service listed here: https://gluster-infra-docs.r
> > > eadt
> > > hedocs.io/emergency.html
> > >
> > > - service non visible to end user (backup servers, ansible
> > > deployment
> > > etc) can be reboot at will
> > >
> > > Then the only question is what about stuff not in the previous
> > > category, like softserve, fstat.
> > >
> > > Also, all dependencies are as critical as the most critical service
> > > that depend on them. So hypervisors hosting gerrit/jenkins are
> > > critical
> > > (until we find a way to avoid outage), the ones for builders are
> > > not.
> > >
> > >
> > >
> > > Thoughts, ideas ?
> > >
> > >
> > > --
> > > Michael Scherer
> > > Sysadmin, Community Infrastructure and Platform, OSAS
> > >
> > > ___
> > > Gluster-infra mailing list
> > > Gluster-infra@gluster.org
> > > https://lists.gluster.org/mailman/listinfo/gluster-infra
> >
> >
> >
> --
> Michael Scherer
> Sysadmin, Community Infrastructure and Platform, OSAS
>
>
> ___
> Gluster-devel mailing list
> gluster-de...@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel
>
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] [Gluster-devel] Fwd: Gerrit downtime on Aug 8, 2016

2018-08-08 Thread Yaniv Kaul
On Wed, Aug 8, 2018 at 1:28 PM, Deepshikha Khandelwal 
wrote:

> Gerrit is now upgraded to the newer version and is back online.
>

Nice, thanks!
I'm trying out the new UI. Needs getting used to, I guess.
Have we upgraded to NotesDB?
Y.


>
> Please file a bug if you face any issue.
> On Tue, Aug 7, 2018 at 11:53 AM Nigel Babu  wrote:
> >
> > Reminder, this upgrade is tomorrow.
> >
> > -- Forwarded message -
> > From: Nigel Babu 
> > Date: Fri, Jul 27, 2018 at 5:28 PM
> > Subject: Gerrit downtime on Aug 8, 2016
> > To: gluster-devel 
> > Cc: gluster-infra , <
> automated-test...@gluster.org>
> >
> >
> > Hello,
> >
> > It's been a while since we upgraded Gerrit. We plan to do a full upgrade
> and move to 2.15.3. Among other changes, this brings in the new PolyGerrit
> interface which brings significant frontend changes. You can take a look at
> how this would look on the staging site[1].
> >
> > ## Outage Window
> > 0330 EDT to 0730 EDT
> > 0730 UTC to 1130 UTC
> > 1300 IST to 1700 IST
> >
> > The actual time needed for the upgrade is about than hour, but we want
> to keep a larger window open to rollback in the event of any problems
> during the upgrade.
> >
> > --
> > nigelb
> >
> >
> > --
> > nigelb
> > ___
> > Gluster-infra mailing list
> > Gluster-infra@gluster.org
> > https://lists.gluster.org/mailman/listinfo/gluster-infra
> ___
> Gluster-devel mailing list
> gluster-de...@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel
>
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] [Gluster-devel] bug-1432542-mpx-restart-crash.t failing

2018-07-09 Thread Yaniv Kaul
On Mon, Jul 9, 2018, 5:41 PM Nithya Balachandran 
wrote:

> We discussed reducing the number of volumes in the maintainers'
> meeting.Should we still go ahead and do that?
>

Do we know how much will it save us? There is value in some moderate number
of volumes (especially if we can ensure they are not all identical).
Y.


>
> On 9 July 2018 at 15:45, Xavi Hernandez  wrote:
>
>> On Mon, Jul 9, 2018 at 11:14 AM Karthik Subrahmanya 
>> wrote:
>>
>>> Hi Deepshikha,
>>>
>>> Are you looking into this failure? I can still see this happening for
>>> all the regression runs.
>>>
>>
>> I've executed the failing script on my laptop and all tests finish
>> relatively fast. What seems to take time is the final cleanup. I can see
>> 'semanage' taking some CPU during destruction of volumes. The test required
>> 350 seconds to finish successfully.
>>
>> Not sure what caused the cleanup time to increase, but I've created a bug
>> [1] to track this and a patch [2] to give more time to this test. This
>> should allow all blocked regressions to complete successfully.
>>
>> Xavi
>>
>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1599250
>> [2] https://review.gluster.org/20482
>>
>>
>>> Thanks & Regards,
>>> Karthik
>>>
>>> On Sun, Jul 8, 2018 at 7:18 AM Atin Mukherjee 
>>> wrote:
>>>

 https://build.gluster.org/job/regression-test-with-multiplex/794/display/redirect
 has the same test failing. Is the reason of the failure different given
 this is on jenkins?

 On Sat, 7 Jul 2018 at 19:12, Deepshikha Khandelwal 
 wrote:

> Hi folks,
>
> The issue[1] has been resolved. Now the softserve instance will be
> having 2GB RAM i.e. same as that of the Jenkins builder's sizing
> configurations.
>
> [1] https://github.com/gluster/softserve/issues/40
>
> Thanks,
> Deepshikha Khandelwal
>
> On Fri, Jul 6, 2018 at 6:14 PM, Karthik Subrahmanya <
> ksubr...@redhat.com> wrote:
> >
> >
> > On Fri 6 Jul, 2018, 5:18 PM Deepshikha Khandelwal, <
> dkhan...@redhat.com>
> > wrote:
> >>
> >> Hi Poornima/Karthik,
> >>
> >> We've looked into the memory error that this softserve instance have
> >> showed up. These machine instances have 1GB RAM which is not in the
> >> case with the Jenkins builder. It's 2GB RAM there.
> >>
> >> We've created the issue [1] and will solve it sooner.
> >
> > Great. Thanks for the update.
> >>
> >>
> >> Sorry for the inconvenience.
> >>
> >> [1] https://github.com/gluster/softserve/issues/40
> >>
> >> Thanks,
> >> Deepshikha Khandelwal
> >>
> >> On Fri, Jul 6, 2018 at 3:44 PM, Karthik Subrahmanya <
> ksubr...@redhat.com>
> >> wrote:
> >> > Thanks Poornima for the analysis.
> >> > Can someone work on fixing this please?
> >> >
> >> > ~Karthik
> >> >
> >> > On Fri, Jul 6, 2018 at 3:17 PM Poornima Gurusiddaiah
> >> > 
> >> > wrote:
> >> >>
> >> >> The same test case is failing for my patch as well [1]. I
> requested for
> >> >> a
> >> >> regression system and tried to reproduce it.
> >> >> From my analysis, the brick process (mutiplexed) is consuming a
> lot of
> >> >> memory, and is being OOM killed. The regression has 1GB ram and
> the
> >> >> process
> >> >> is consuming more than 1GB. 1GB for 120 bricks is acceptable
> >> >> considering
> >> >> there is 1000 threads in that brick process.
> >> >> Ways to fix:
> >> >> - Increase the regression system RAM size OR
> >> >> - Decrease the number of volumes in the test case.
> >> >>
> >> >> But what is strange is why the test passes sometimes for some
> patches.
> >> >> There could be some bug/? in memory consumption.
> >> >>
> >> >> Regards,
> >> >> Poornima
> >> >>
> >> >>
> >> >> On Fri, Jul 6, 2018 at 2:11 PM, Karthik Subrahmanya
> >> >> 
> >> >> wrote:
> >> >>>
> >> >>> Hi,
> >> >>>
> >> >>> $subject is failing on centos regression for most of the
> patches with
> >> >>> timeout error.
> >> >>>
> >> >>> 07:32:34
> >> >>>
> >> >>>
> 
> >> >>> 07:32:34 [07:33:05] Running tests in file
> >> >>> ./tests/bugs/core/bug-1432542-mpx-restart-crash.t
> >> >>> 07:32:34 Timeout set is 300, default 200
> >> >>> 07:37:34 ./tests/bugs/core/bug-1432542-mpx-restart-crash.t
> timed out
> >> >>> after 300 seconds
> >> >>> 07:37:34 ./tests/bugs/core/bug-1432542-mpx-restart-crash.t: bad
> status
> >> >>> 124
> >> >>> 07:37:34
> >> >>> 07:37:34*
> >> >>> 07:37:34*   REGRESSION FAILED   *
> >> >>> 07:37:34* Retrying failed tests in case *
> >> >>> 07:37:34* we got some spurious failures *
> >> >>> 07:37:34  

[Gluster-infra] 'Clone with commit-msg hook' produces wrong scp command

2018-06-18 Thread Yaniv Kaul
When I choose 'clone with a commit-msg hook' in Gerrit, I get the following
scp command:
git clone ssh://myk...@review.gluster.org/glusterfs-specs && scp -p *-P
29418* myk...@review.gluster.org:hooks/commit-msg
glusterfs-specs/.git/hooks/

See the part in bold in the scp command - it points to use port 29418.
This is incorrect in review.gluster.org.
The standard port does work.

TIA,
Y.
___
Gluster-infra mailing list
Gluster-infra@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-infra