[Gluster-infra] [Bug 1696518] New: builder203 does not have a valid hostname set

2019-04-04 Thread bugzilla
https://bugzilla.redhat.com/show_bug.cgi?id=1696518

Bug ID: 1696518
   Summary: builder203 does not have a valid hostname set
   Product: GlusterFS
   Version: mainline
Status: NEW
 Component: project-infrastructure
  Assignee: b...@gluster.org
  Reporter: dkhan...@redhat.com
CC: b...@gluster.org, gluster-infra@gluster.org
  Target Milestone: ---
Classification: Community



Description of problem:


After reinstallation builder203 on AWS does not have a valid hostname set and
hence it's network service might behave weird.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra


Re: [Gluster-infra] [Gluster-devel] rebal-all-nodes-migrate.t always fails now

2019-04-04 Thread Michael Scherer
Le jeudi 04 avril 2019 à 19:10 +0300, Yaniv Kaul a écrit :
> I'm not convinced this is solved. Just had what I believe is a
> similar
> failure:
> 
> *00:12:02.532* A dependency job for rpc-statd.service failed. See
> 'journalctl -xe' for details.*00:12:02.532* mount.nfs: rpc.statd is
> not running but is required for remote locking.*00:12:02.532*
> mount.nfs: Either use '-o nolock' to keep locks local, or start
> statd.*00:12:02.532* mount.nfs: an incorrect mount option was
> specified
> 
> (of course, it can always be my patch!)
> 
> https://build.gluster.org/job/centos7-regression/5384/console

same issue, different builder (206). I will check them all, as the
issue is more widespread than I expected (or it did popup since last
time I checked).


> 
> On Thu, Apr 4, 2019 at 6:56 PM Atin Mukherjee 
> wrote:
> 
> > Thanks misc. I have always seen a pattern that on a reattempt
> > (recheck
> > centos) the same builder is picked up many time even though it's
> > promised
> > to pick up the builders in a round robin manner.
> > 
> > On Thu, Apr 4, 2019 at 7:24 PM Michael Scherer  > >
> > wrote:
> > 
> > > Le jeudi 04 avril 2019 à 15:19 +0200, Michael Scherer a écrit :
> > > > Le jeudi 04 avril 2019 à 13:53 +0200, Michael Scherer a écrit :
> > > > > Le jeudi 04 avril 2019 à 16:13 +0530, Atin Mukherjee a écrit
> > > > > :
> > > > > > Based on what I have seen that any multi node test case
> > > > > > will fail
> > > > > > and
> > > > > > the
> > > > > > above one is picked first from that group and If I am
> > > > > > correct
> > > > > > none
> > > > > > of
> > > > > > the
> > > > > > code fixes will go through the regression until this is
> > > > > > fixed. I
> > > > > > suspect it
> > > > > > to be an infra issue again. If we look at
> > > > > > https://review.gluster.org/#/c/glusterfs/+/22501/ &
> > > > > > https://build.gluster.org/job/centos7-regression/5382/ peer
> > > > > > handshaking is
> > > > > > stuck as 127.1.1.1 is unable to receive a response back,
> > > > > > did we
> > > > > > end
> > > > > > up
> > > > > > having firewall and other n/w settings screwed up? The test
> > > > > > never
> > > > > > fails
> > > > > > locally.
> > > > > 
> > > > > The firewall didn't change, and since the start has a line:
> > > > > "-A INPUT -i lo -j ACCEPT", so all traffic on the localhost
> > > > > interface
> > > > > work. (I am not even sure that netfilter do anything
> > > > > meaningful on
> > > > > the
> > > > > loopback interface, but maybe I am wrong, and not keen on
> > > > > looking
> > > > > kernel code for that).
> > > > > 
> > > > > 
> > > > > Ping seems to work fine as well, so we can exclude a routing
> > > > > issue.
> > > > > 
> > > > > Maybe we should look at the socket, does it listen to a
> > > > > specific
> > > > > address or not ?
> > > > 
> > > > So, I did look at the 20 first ailure, removed all not related
> > > > to
> > > > rebal-all-nodes-migrate.t and seen all were run on builder203,
> > > > who
> > > > was
> > > > freshly reinstalled. As Deepshika noticed today, this one had a
> > > > issue
> > > > with ipv6, the 2nd issue we were tracking.
> > > > 
> > > > Summary, rpcbind.socket systemd unit listen on ipv6 despites
> > > > ipv6
> > > > being
> > > > disabled, and the fix is to reload systemd. We have so far no
> > > > idea on
> > > > why it happen, but suspect this might be related to the network
> > > > issue
> > > > we did identify, as that happen only after a reboot, that
> > > > happen only
> > > > if a build is cancelled/crashed/aborted.
> > > > 
> > > > I apply the workaround on builder203, so if the culprit is that
> > > > specific issue, guess that's fixed.
> > > > 
> > > > I started a test to see how it go:
> > > > https://build.gluster.org/job/centos7-regression/5383/
> > > 
> > > The test did just pass, so I would assume the problem was local
> > > to
> > > builder203. Not sure why it was always selected, except because
> > > this
> > > was the only one that failed, so was always up for getting new
> > > jobs.
> > > 
> > > Maybe we should increase the number of builder so this doesn't
> > > happen,
> > > as I guess the others builders were busy at that time ?
> > > 
> > > --
> > > Michael Scherer
> > > Sysadmin, Community Infrastructure and Platform, OSAS
> > > 
> > > 
> > > ___
> > 
> > Gluster-devel mailing list
> > gluster-de...@gluster.org
> > https://lists.gluster.org/mailman/listinfo/gluster-devel
-- 
Michael Scherer
Sysadmin, Community Infrastructure and Platform, OSAS




signature.asc
Description: This is a digitally signed message part
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] [Gluster-devel] rebal-all-nodes-migrate.t always fails now

2019-04-04 Thread Yaniv Kaul
I'm not convinced this is solved. Just had what I believe is a similar
failure:

*00:12:02.532* A dependency job for rpc-statd.service failed. See
'journalctl -xe' for details.*00:12:02.532* mount.nfs: rpc.statd is
not running but is required for remote locking.*00:12:02.532*
mount.nfs: Either use '-o nolock' to keep locks local, or start
statd.*00:12:02.532* mount.nfs: an incorrect mount option was
specified

(of course, it can always be my patch!)

https://build.gluster.org/job/centos7-regression/5384/console


On Thu, Apr 4, 2019 at 6:56 PM Atin Mukherjee  wrote:

> Thanks misc. I have always seen a pattern that on a reattempt (recheck
> centos) the same builder is picked up many time even though it's promised
> to pick up the builders in a round robin manner.
>
> On Thu, Apr 4, 2019 at 7:24 PM Michael Scherer 
> wrote:
>
>> Le jeudi 04 avril 2019 à 15:19 +0200, Michael Scherer a écrit :
>> > Le jeudi 04 avril 2019 à 13:53 +0200, Michael Scherer a écrit :
>> > > Le jeudi 04 avril 2019 à 16:13 +0530, Atin Mukherjee a écrit :
>> > > > Based on what I have seen that any multi node test case will fail
>> > > > and
>> > > > the
>> > > > above one is picked first from that group and If I am correct
>> > > > none
>> > > > of
>> > > > the
>> > > > code fixes will go through the regression until this is fixed. I
>> > > > suspect it
>> > > > to be an infra issue again. If we look at
>> > > > https://review.gluster.org/#/c/glusterfs/+/22501/ &
>> > > > https://build.gluster.org/job/centos7-regression/5382/ peer
>> > > > handshaking is
>> > > > stuck as 127.1.1.1 is unable to receive a response back, did we
>> > > > end
>> > > > up
>> > > > having firewall and other n/w settings screwed up? The test never
>> > > > fails
>> > > > locally.
>> > >
>> > > The firewall didn't change, and since the start has a line:
>> > > "-A INPUT -i lo -j ACCEPT", so all traffic on the localhost
>> > > interface
>> > > work. (I am not even sure that netfilter do anything meaningful on
>> > > the
>> > > loopback interface, but maybe I am wrong, and not keen on looking
>> > > kernel code for that).
>> > >
>> > >
>> > > Ping seems to work fine as well, so we can exclude a routing issue.
>> > >
>> > > Maybe we should look at the socket, does it listen to a specific
>> > > address or not ?
>> >
>> > So, I did look at the 20 first ailure, removed all not related to
>> > rebal-all-nodes-migrate.t and seen all were run on builder203, who
>> > was
>> > freshly reinstalled. As Deepshika noticed today, this one had a issue
>> > with ipv6, the 2nd issue we were tracking.
>> >
>> > Summary, rpcbind.socket systemd unit listen on ipv6 despites ipv6
>> > being
>> > disabled, and the fix is to reload systemd. We have so far no idea on
>> > why it happen, but suspect this might be related to the network issue
>> > we did identify, as that happen only after a reboot, that happen only
>> > if a build is cancelled/crashed/aborted.
>> >
>> > I apply the workaround on builder203, so if the culprit is that
>> > specific issue, guess that's fixed.
>> >
>> > I started a test to see how it go:
>> > https://build.gluster.org/job/centos7-regression/5383/
>>
>> The test did just pass, so I would assume the problem was local to
>> builder203. Not sure why it was always selected, except because this
>> was the only one that failed, so was always up for getting new jobs.
>>
>> Maybe we should increase the number of builder so this doesn't happen,
>> as I guess the others builders were busy at that time ?
>>
>> --
>> Michael Scherer
>> Sysadmin, Community Infrastructure and Platform, OSAS
>>
>>
>> ___
> Gluster-devel mailing list
> gluster-de...@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] [Gluster-devel] rebal-all-nodes-migrate.t always fails now

2019-04-04 Thread Atin Mukherjee
Thanks misc. I have always seen a pattern that on a reattempt (recheck
centos) the same builder is picked up many time even though it's promised
to pick up the builders in a round robin manner.

On Thu, Apr 4, 2019 at 7:24 PM Michael Scherer  wrote:

> Le jeudi 04 avril 2019 à 15:19 +0200, Michael Scherer a écrit :
> > Le jeudi 04 avril 2019 à 13:53 +0200, Michael Scherer a écrit :
> > > Le jeudi 04 avril 2019 à 16:13 +0530, Atin Mukherjee a écrit :
> > > > Based on what I have seen that any multi node test case will fail
> > > > and
> > > > the
> > > > above one is picked first from that group and If I am correct
> > > > none
> > > > of
> > > > the
> > > > code fixes will go through the regression until this is fixed. I
> > > > suspect it
> > > > to be an infra issue again. If we look at
> > > > https://review.gluster.org/#/c/glusterfs/+/22501/ &
> > > > https://build.gluster.org/job/centos7-regression/5382/ peer
> > > > handshaking is
> > > > stuck as 127.1.1.1 is unable to receive a response back, did we
> > > > end
> > > > up
> > > > having firewall and other n/w settings screwed up? The test never
> > > > fails
> > > > locally.
> > >
> > > The firewall didn't change, and since the start has a line:
> > > "-A INPUT -i lo -j ACCEPT", so all traffic on the localhost
> > > interface
> > > work. (I am not even sure that netfilter do anything meaningful on
> > > the
> > > loopback interface, but maybe I am wrong, and not keen on looking
> > > kernel code for that).
> > >
> > >
> > > Ping seems to work fine as well, so we can exclude a routing issue.
> > >
> > > Maybe we should look at the socket, does it listen to a specific
> > > address or not ?
> >
> > So, I did look at the 20 first ailure, removed all not related to
> > rebal-all-nodes-migrate.t and seen all were run on builder203, who
> > was
> > freshly reinstalled. As Deepshika noticed today, this one had a issue
> > with ipv6, the 2nd issue we were tracking.
> >
> > Summary, rpcbind.socket systemd unit listen on ipv6 despites ipv6
> > being
> > disabled, and the fix is to reload systemd. We have so far no idea on
> > why it happen, but suspect this might be related to the network issue
> > we did identify, as that happen only after a reboot, that happen only
> > if a build is cancelled/crashed/aborted.
> >
> > I apply the workaround on builder203, so if the culprit is that
> > specific issue, guess that's fixed.
> >
> > I started a test to see how it go:
> > https://build.gluster.org/job/centos7-regression/5383/
>
> The test did just pass, so I would assume the problem was local to
> builder203. Not sure why it was always selected, except because this
> was the only one that failed, so was always up for getting new jobs.
>
> Maybe we should increase the number of builder so this doesn't happen,
> as I guess the others builders were busy at that time ?
>
> --
> Michael Scherer
> Sysadmin, Community Infrastructure and Platform, OSAS
>
>
>
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] [Gluster-devel] rebal-all-nodes-migrate.t always fails now

2019-04-04 Thread Michael Scherer
Le jeudi 04 avril 2019 à 15:19 +0200, Michael Scherer a écrit :
> Le jeudi 04 avril 2019 à 13:53 +0200, Michael Scherer a écrit :
> > Le jeudi 04 avril 2019 à 16:13 +0530, Atin Mukherjee a écrit :
> > > Based on what I have seen that any multi node test case will fail
> > > and
> > > the
> > > above one is picked first from that group and If I am correct
> > > none
> > > of
> > > the
> > > code fixes will go through the regression until this is fixed. I
> > > suspect it
> > > to be an infra issue again. If we look at
> > > https://review.gluster.org/#/c/glusterfs/+/22501/ &
> > > https://build.gluster.org/job/centos7-regression/5382/ peer
> > > handshaking is
> > > stuck as 127.1.1.1 is unable to receive a response back, did we
> > > end
> > > up
> > > having firewall and other n/w settings screwed up? The test never
> > > fails
> > > locally.
> > 
> > The firewall didn't change, and since the start has a line:
> > "-A INPUT -i lo -j ACCEPT", so all traffic on the localhost
> > interface
> > work. (I am not even sure that netfilter do anything meaningful on
> > the
> > loopback interface, but maybe I am wrong, and not keen on looking
> > kernel code for that).
> > 
> > 
> > Ping seems to work fine as well, so we can exclude a routing issue.
> > 
> > Maybe we should look at the socket, does it listen to a specific
> > address or not ?
> 
> So, I did look at the 20 first ailure, removed all not related to
> rebal-all-nodes-migrate.t and seen all were run on builder203, who
> was
> freshly reinstalled. As Deepshika noticed today, this one had a issue
> with ipv6, the 2nd issue we were tracking.
> 
> Summary, rpcbind.socket systemd unit listen on ipv6 despites ipv6
> being
> disabled, and the fix is to reload systemd. We have so far no idea on
> why it happen, but suspect this might be related to the network issue
> we did identify, as that happen only after a reboot, that happen only
> if a build is cancelled/crashed/aborted.
> 
> I apply the workaround on builder203, so if the culprit is that
> specific issue, guess that's fixed. 
> 
> I started a test to see how it go:
> https://build.gluster.org/job/centos7-regression/5383/

The test did just pass, so I would assume the problem was local to
builder203. Not sure why it was always selected, except because this
was the only one that failed, so was always up for getting new jobs. 

Maybe we should increase the number of builder so this doesn't happen,
as I guess the others builders were busy at that time ?

-- 
Michael Scherer
Sysadmin, Community Infrastructure and Platform, OSAS




signature.asc
Description: This is a digitally signed message part
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] [Gluster-devel] rebal-all-nodes-migrate.t always fails now

2019-04-04 Thread Michael Scherer
Le jeudi 04 avril 2019 à 13:53 +0200, Michael Scherer a écrit :
> Le jeudi 04 avril 2019 à 16:13 +0530, Atin Mukherjee a écrit :
> > Based on what I have seen that any multi node test case will fail
> > and
> > the
> > above one is picked first from that group and If I am correct none
> > of
> > the
> > code fixes will go through the regression until this is fixed. I
> > suspect it
> > to be an infra issue again. If we look at
> > https://review.gluster.org/#/c/glusterfs/+/22501/ &
> > https://build.gluster.org/job/centos7-regression/5382/ peer
> > handshaking is
> > stuck as 127.1.1.1 is unable to receive a response back, did we end
> > up
> > having firewall and other n/w settings screwed up? The test never
> > fails
> > locally.
> 
> The firewall didn't change, and since the start has a line:
> "-A INPUT -i lo -j ACCEPT", so all traffic on the localhost interface
> work. (I am not even sure that netfilter do anything meaningful on
> the
> loopback interface, but maybe I am wrong, and not keen on looking
> kernel code for that).
> 
> 
> Ping seems to work fine as well, so we can exclude a routing issue.
> 
> Maybe we should look at the socket, does it listen to a specific
> address or not ?

So, I did look at the 20 first ailure, removed all not related to
rebal-all-nodes-migrate.t and seen all were run on builder203, who was
freshly reinstalled. As Deepshika noticed today, this one had a issue
with ipv6, the 2nd issue we were tracking.

Summary, rpcbind.socket systemd unit listen on ipv6 despites ipv6 being
disabled, and the fix is to reload systemd. We have so far no idea on
why it happen, but suspect this might be related to the network issue
we did identify, as that happen only after a reboot, that happen only
if a build is cancelled/crashed/aborted.

I apply the workaround on builder203, so if the culprit is that
specific issue, guess that's fixed. 

I started a test to see how it go:
https://build.gluster.org/job/centos7-regression/5383/

-- 
Michael Scherer
Sysadmin, Community Infrastructure and Platform, OSAS




signature.asc
Description: This is a digitally signed message part
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] rebal-all-nodes-migrate.t always fails now

2019-04-04 Thread Michael Scherer
Le jeudi 04 avril 2019 à 16:13 +0530, Atin Mukherjee a écrit :
> Based on what I have seen that any multi node test case will fail and
> the
> above one is picked first from that group and If I am correct none of
> the
> code fixes will go through the regression until this is fixed. I
> suspect it
> to be an infra issue again. If we look at
> https://review.gluster.org/#/c/glusterfs/+/22501/ &
> https://build.gluster.org/job/centos7-regression/5382/ peer
> handshaking is
> stuck as 127.1.1.1 is unable to receive a response back, did we end
> up
> having firewall and other n/w settings screwed up? The test never
> fails
> locally.

The firewall didn't change, and since the start has a line:
"-A INPUT -i lo -j ACCEPT", so all traffic on the localhost interface
work. (I am not even sure that netfilter do anything meaningful on the
loopback interface, but maybe I am wrong, and not keen on looking
kernel code for that).


Ping seems to work fine as well, so we can exclude a routing issue.

Maybe we should look at the socket, does it listen to a specific
address or not ?



> *15:51:21* Number of Peers: 2*15:51:21* *15:51:21* Hostname:
> 127.1.1.2*15:51:21* Uuid:
> 0e689ca8-d522-4b2f-b437-9dcde3579401*15:51:21* State: Accepted peer
> request (Connected)*15:51:21* *15:51:21* Hostname:
> 127.1.1.3*15:51:21*
> Uuid: a83a3bfa-729f-4a1c-8f9a-ae7d04ee4544*15:51:21* State: Accepted
> peer request (Connected)
> ___
> Gluster-infra mailing list
> Gluster-infra@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-infra
-- 
Michael Scherer
Sysadmin, Community Infrastructure and Platform, OSAS




signature.asc
Description: This is a digitally signed message part
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] rebal-all-nodes-migrate.t always fails now

2019-04-04 Thread Atin Mukherjee
Based on what I have seen that any multi node test case will fail and the
above one is picked first from that group and If I am correct none of the
code fixes will go through the regression until this is fixed. I suspect it
to be an infra issue again. If we look at
https://review.gluster.org/#/c/glusterfs/+/22501/ &
https://build.gluster.org/job/centos7-regression/5382/ peer handshaking is
stuck as 127.1.1.1 is unable to receive a response back, did we end up
having firewall and other n/w settings screwed up? The test never fails
locally.

*15:51:21* Number of Peers: 2*15:51:21* *15:51:21* Hostname:
127.1.1.2*15:51:21* Uuid:
0e689ca8-d522-4b2f-b437-9dcde3579401*15:51:21* State: Accepted peer
request (Connected)*15:51:21* *15:51:21* Hostname: 127.1.1.3*15:51:21*
Uuid: a83a3bfa-729f-4a1c-8f9a-ae7d04ee4544*15:51:21* State: Accepted
peer request (Connected)
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra