Re: [Gluster-users] issues recovering machine in gluster

2016-06-15 Thread Arif Ali
On 15 June 2016 at 08:09, Atin Mukherjee  wrote:

>
>
> On 06/15/2016 12:14 PM, Arif Ali wrote:
> >
> > On 15 June 2016 at 06:48, Atin Mukherjee  > > wrote:
> >
> >
> >
> > On 06/15/2016 11:06 AM, Gandalf Corvotempesta wrote:
> > > Il 15 giu 2016 07:09, "Atin Mukherjee"  
> > > >> ha
> scritto:
> > >> To get rid of this situation you'd need to stop all the running
> glusterd
> > >> instances and go into /var/lib/glusterd/peers folder on all the
> nodes
> > >> and manually correct the UUID file names and their content if
> required.
> > >
> > > If i understood properly the only way to fix this is by bringing
> the
> > > whole cluster down? "you'd need to stop all the running glusterd
> instances"
> > >
> > > I hope you are referring to all instances on the failed node...
> >
> > No, since the configuration are synced across all the nodes, any
> > incorrect data gets replicated through out. So in this case to be on
> the
> > safer side and validate the correctness all glusterd instances on
> *all*
> > the nodes should be brought down. Having said that, this doesn't
> impact
> > I/O as the management path is different than I/O.
> >
> >
> > As a sanity, one of the things I did last night, was to reboot the whole
> > gluster system, when I had downtime arranged. I thought this is
> > something would be asked, as I had seen similar requests on the mailing
> > list previously
> >
> > Unfortunately though, it didn't fix the problem.
>
> Only reboot is not going to solve the problem. You'd need to correct the
> configuration as I explained earlier in this thread. If it doesn't
> please send the me the content of /var/lib/glusterd/peers/ &
> /var/lib/glusterd/glusterd.info file from all the nodes where glusterd
> instances are running. I'll take a look and correct them and send it
> back to you.
>

Thanks Atin,

Apologies, I missed your mail, as I was travelling

I have checked the relevant files you have mentioned, and they seem to look
correct to me, but I have attached it for sanity, maybe you can spot
something, that I have not seen


gluster_debug.tgz
Description: GNU Zip compressed data
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] issues recovering machine in gluster

2016-06-15 Thread Atin Mukherjee


On 06/15/2016 12:14 PM, Arif Ali wrote:
> 
> On 15 June 2016 at 06:48, Atin Mukherjee  > wrote:
> 
> 
> 
> On 06/15/2016 11:06 AM, Gandalf Corvotempesta wrote:
> > Il 15 giu 2016 07:09, "Atin Mukherjee"  
> > >> ha scritto:
> >> To get rid of this situation you'd need to stop all the running 
> glusterd
> >> instances and go into /var/lib/glusterd/peers folder on all the nodes
> >> and manually correct the UUID file names and their content if required.
> >
> > If i understood properly the only way to fix this is by bringing the
> > whole cluster down? "you'd need to stop all the running glusterd 
> instances"
> >
> > I hope you are referring to all instances on the failed node...
> 
> No, since the configuration are synced across all the nodes, any
> incorrect data gets replicated through out. So in this case to be on the
> safer side and validate the correctness all glusterd instances on *all*
> the nodes should be brought down. Having said that, this doesn't impact
> I/O as the management path is different than I/O.
> 
> 
> As a sanity, one of the things I did last night, was to reboot the whole
> gluster system, when I had downtime arranged. I thought this is
> something would be asked, as I had seen similar requests on the mailing
> list previously
> 
> Unfortunately though, it didn't fix the problem.

Only reboot is not going to solve the problem. You'd need to correct the
configuration as I explained earlier in this thread. If it doesn't
please send the me the content of /var/lib/glusterd/peers/ &
/var/lib/glusterd/glusterd.info file from all the nodes where glusterd
instances are running. I'll take a look and correct them and send it
back to you.

> 
> Any other suggestions are welcome
> 
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] issues recovering machine in gluster

2016-06-15 Thread Arif Ali
On 15 June 2016 at 06:48, Atin Mukherjee  wrote:

>
>
> On 06/15/2016 11:06 AM, Gandalf Corvotempesta wrote:
> > Il 15 giu 2016 07:09, "Atin Mukherjee"  > > ha scritto:
> >> To get rid of this situation you'd need to stop all the running glusterd
> >> instances and go into /var/lib/glusterd/peers folder on all the nodes
> >> and manually correct the UUID file names and their content if required.
> >
> > If i understood properly the only way to fix this is by bringing the
> > whole cluster down? "you'd need to stop all the running glusterd
> instances"
> >
> > I hope you are referring to all instances on the failed node...
>
> No, since the configuration are synced across all the nodes, any
> incorrect data gets replicated through out. So in this case to be on the
> safer side and validate the correctness all glusterd instances on *all*
> the nodes should be brought down. Having said that, this doesn't impact
> I/O as the management path is different than I/O.
>
>
As a sanity, one of the things I did last night, was to reboot the whole
gluster system, when I had downtime arranged. I thought this is something
would be asked, as I had seen similar requests on the mailing list
previously

Unfortunately though, it didn't fix the problem.

Any other suggestions are welcome
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] issues recovering machine in gluster

2016-06-14 Thread Atin Mukherjee


On 06/15/2016 11:06 AM, Gandalf Corvotempesta wrote:
> Il 15 giu 2016 07:09, "Atin Mukherjee"  > ha scritto:
>> To get rid of this situation you'd need to stop all the running glusterd
>> instances and go into /var/lib/glusterd/peers folder on all the nodes
>> and manually correct the UUID file names and their content if required.
> 
> If i understood properly the only way to fix this is by bringing the
> whole cluster down? "you'd need to stop all the running glusterd instances"
> 
> I hope you are referring to all instances on the failed node...

No, since the configuration are synced across all the nodes, any
incorrect data gets replicated through out. So in this case to be on the
safer side and validate the correctness all glusterd instances on *all*
the nodes should be brought down. Having said that, this doesn't impact
I/O as the management path is different than I/O.

> 
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] issues recovering machine in gluster

2016-06-14 Thread Gandalf Corvotempesta
Il 15 giu 2016 07:09, "Atin Mukherjee"  ha scritto:
> To get rid of this situation you'd need to stop all the running glusterd
> instances and go into /var/lib/glusterd/peers folder on all the nodes
> and manually correct the UUID file names and their content if required.

If i understood properly the only way to fix this is by bringing the whole
cluster down? "you'd need to stop all the running glusterd instances"

I hope you are referring to all instances on the failed node...
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] issues recovering machine in gluster

2016-06-14 Thread Atin Mukherjee
So the issue looks like an incorrect UUID got populated in the peer
configuration which lead to this inconsistency and here is the log entry
to prove this. I have a feeling that the steps were not properly
performed or you missed to copy the older UUID of the failed node to the
new one.

[2016-06-13 18:25:09.738363] E [MSGID: 106170]
[glusterd-handshake.c:1051:gd_validate_mgmt_hndsk_req] 0-management:
Request from peer 10.28.9.12:65299 has an entry in peerinfo, but uuid
does not match

To get rid of this situation you'd need to stop all the running glusterd
instances and go into /var/lib/glusterd/peers folder on all the nodes
and manually correct the UUID file names and their content if required.

Just to give you an idea on how the peer configurations are structured
and stored, here is an example:

On a 3 node cluster (say N1, N2, N3)
N1's UUID - dc07f77f-09f3-46f4-8d92-f2d7f6e627af

(By 'cat /var/lib/glusterd/glusterd.info | grep UUID' on N1)

N2's UUID -  02d157bd-a738-4914-991e-60953409f1b1
N3's UUID -  932186a6-4b29-4216-8da1-2fe193c928c1

N1's peer configuration
===
root@ebbc696b4dc4:/home/glusterfs# cd /var/lib/glusterd/peers/
root@ebbc696b4dc4:/var/lib/glusterd/peers# ls -lrt
total 8
-rw--- 1 root root 71 Jun 15 05:01
02d157bd-a738-4914-991e-60953409f1b1   -> N2's UUID
-rw--- 1 root root 71 Jun 15 05:02
932186a6-4b29-4216-8da1-2fe193c928c1 - N3's UUID


Content of other peers (N2,3) on N1's disk
==
root@ebbc696b4dc4:/var/lib/glusterd/peers# cat
02d157bd-a738-4914-991e-60953409f1b1
uuid=02d157bd-a738-4914-991e-60953409f1b1
state=3
hostname1=172.17.0.3

root@ebbc696b4dc4:/var/lib/glusterd/peers# cat
932186a6-4b29-4216-8da1-2fe193c928c1
uuid=932186a6-4b29-4216-8da1-2fe193c928c1
state=3
hostname1=172.17.0.4

Similarly you will find the details of N1, N2 on N3 & N1 & N3 on N2.

You'd need to validate this theory on all the nodes and correct the
content and remove the unwanted UUIDs. Post that restarting all the
glusterd instances should solve the problem.

HTH,
Atin


On 06/13/2016 08:16 PM, Atin Mukherjee wrote:
> Please send us the glusterd log file along with cmd_history.log from all
> the 6 nodes. The logs you mentioned in the thread are not relevant to
> debug the issue. Which gluster version are you using?
> 
> ~Atin
> 
> On 06/13/2016 06:49 PM, Arif Ali wrote:
>> Hi all,
>>
>> Hopefully, someone can help
>>
>> We have a 6 node gluster setup, and have successfully got the gluster
>> system up and running, and had no issues with the initial install.
>>
>> For other reasons, we had to re-provision the nodes, and therefore we
>> had to go through some recovery steps to get the node back into the
>> system. The documentation I used was [1].
>>
>> The key thing is that everything in the documentation worked without a
>> problem. The replication of gluster works, and can easily monitor that
>> through the heal commands.
>>
>> Unfortunately, we are not able to run "gluster volume status", which
>> hangs for a moment, and in the end we get "Error : Request timed out ".
>> Most of the log files are clean, except for
>> /var/log/glusterfs/etc-glusterfs-glusterd.vol.log. See below for some of
>> the contents
>>
>> [2016-06-13 12:57:01.054458] W [socket.c:870:__socket_keepalive]
>> 0-socket: failed to set TCP_USER_TIMEOUT -1000 on socket 45, Invalid
>> argument
>> [2016-06-13 12:57:01.054492] E [socket.c:2966:socket_connect]
>> 0-management: Failed to set keep-alive: Invalid argument
>> [2016-06-13 12:57:01.059023] W [socket.c:870:__socket_keepalive]
>> 0-socket: failed to set TCP_USER_TIMEOUT -1000 on socket 45, Invalid
>> argument
>> [2016-06-13 12:57:01.059042] E [socket.c:2966:socket_connect]
>> 0-management: Failed to set keep-alive: Invalid argument
>>
>> Any assistance on this would be much appreciated.
>>
>> [1] 
>> https://access.redhat.com/documentation/en-US/Red_Hat_Storage/3/html/Administration_Guide/sect-Replacing_Hosts.html#Replacing_a_Host_Machine_with_the_Same_Hostname
>>
>> --
>> Arif Ali
>>
>> IRC: arif-ali at freenode
>> LinkedIn: http://uk.linkedin.com/in/arifali
>>
>>
>> ___
>> Gluster-users mailing list
>> Gluster-users@gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-users
>>
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users
> 
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] issues recovering machine in gluster

2016-06-13 Thread Atin Mukherjee
I'll take a look at the logs tomorrow and get back.

-Atin

On Monday 13 June 2016, Arif Ali  wrote:

> Hi Atin,
>
> I have sent the tar file of logs in a PM
>
> The version of gluster, that we have been using is
>
> # rpm -qa | grep gluster
> glusterfs-api-3.7.11-1.el7.x86_64
> glusterfs-geo-replication-3.7.11-1.el7.x86_64
> glusterfs-libs-3.7.11-1.el7.x86_64
> glusterfs-client-xlators-3.7.11-1.el7.x86_64
> glusterfs-fuse-3.7.11-1.el7.x86_64
> glusterfs-server-3.7.11-1.el7.x86_64
> glusterfs-3.7.11-1.el7.x86_64
> glusterfs-cli-3.7.11-1.el7.x86_64
>
> --
> Arif Ali
>
> IRC: arif-ali at freenode
> LinkedIn: http://uk.linkedin.com/in/arifali
>
> On 13 June 2016 at 15:46, Atin Mukherjee  > wrote:
>
>> Please send us the glusterd log file along with cmd_history.log from all
>> the 6 nodes. The logs you mentioned in the thread are not relevant to
>> debug the issue. Which gluster version are you using?
>>
>> ~Atin
>>
>> On 06/13/2016 06:49 PM, Arif Ali wrote:
>> > Hi all,
>> >
>> > Hopefully, someone can help
>> >
>> > We have a 6 node gluster setup, and have successfully got the gluster
>> > system up and running, and had no issues with the initial install.
>> >
>> > For other reasons, we had to re-provision the nodes, and therefore we
>> > had to go through some recovery steps to get the node back into the
>> > system. The documentation I used was [1].
>> >
>> > The key thing is that everything in the documentation worked without a
>> > problem. The replication of gluster works, and can easily monitor that
>> > through the heal commands.
>> >
>> > Unfortunately, we are not able to run "gluster volume status", which
>> > hangs for a moment, and in the end we get "Error : Request timed out ".
>> > Most of the log files are clean, except for
>> > /var/log/glusterfs/etc-glusterfs-glusterd.vol.log. See below for some of
>> > the contents
>> >
>> > [2016-06-13 12:57:01.054458] W [socket.c:870:__socket_keepalive]
>> > 0-socket: failed to set TCP_USER_TIMEOUT -1000 on socket 45, Invalid
>> > argument
>> > [2016-06-13 12:57:01.054492] E [socket.c:2966:socket_connect]
>> > 0-management: Failed to set keep-alive: Invalid argument
>> > [2016-06-13 12:57:01.059023] W [socket.c:870:__socket_keepalive]
>> > 0-socket: failed to set TCP_USER_TIMEOUT -1000 on socket 45, Invalid
>> > argument
>> > [2016-06-13 12:57:01.059042] E [socket.c:2966:socket_connect]
>> > 0-management: Failed to set keep-alive: Invalid argument
>> >
>> > Any assistance on this would be much appreciated.
>> >
>> > [1]
>> https://access.redhat.com/documentation/en-US/Red_Hat_Storage/3/html/Administration_Guide/sect-Replacing_Hosts.html#Replacing_a_Host_Machine_with_the_Same_Hostname
>> >
>> > --
>> > Arif Ali
>> >
>> > IRC: arif-ali at freenode
>> > LinkedIn: http://uk.linkedin.com/in/arifali
>> >
>> >
>> > ___
>> > Gluster-users mailing list
>> > Gluster-users@gluster.org
>> 
>> > http://www.gluster.org/mailman/listinfo/gluster-users
>> >
>>
>
>
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] issues recovering machine in gluster

2016-06-13 Thread Arif Ali
Hi Atin,

I have sent the tar file of logs in a PM

The version of gluster, that we have been using is

# rpm -qa | grep gluster
glusterfs-api-3.7.11-1.el7.x86_64
glusterfs-geo-replication-3.7.11-1.el7.x86_64
glusterfs-libs-3.7.11-1.el7.x86_64
glusterfs-client-xlators-3.7.11-1.el7.x86_64
glusterfs-fuse-3.7.11-1.el7.x86_64
glusterfs-server-3.7.11-1.el7.x86_64
glusterfs-3.7.11-1.el7.x86_64
glusterfs-cli-3.7.11-1.el7.x86_64

--
Arif Ali

IRC: arif-ali at freenode
LinkedIn: http://uk.linkedin.com/in/arifali

On 13 June 2016 at 15:46, Atin Mukherjee  wrote:

> Please send us the glusterd log file along with cmd_history.log from all
> the 6 nodes. The logs you mentioned in the thread are not relevant to
> debug the issue. Which gluster version are you using?
>
> ~Atin
>
> On 06/13/2016 06:49 PM, Arif Ali wrote:
> > Hi all,
> >
> > Hopefully, someone can help
> >
> > We have a 6 node gluster setup, and have successfully got the gluster
> > system up and running, and had no issues with the initial install.
> >
> > For other reasons, we had to re-provision the nodes, and therefore we
> > had to go through some recovery steps to get the node back into the
> > system. The documentation I used was [1].
> >
> > The key thing is that everything in the documentation worked without a
> > problem. The replication of gluster works, and can easily monitor that
> > through the heal commands.
> >
> > Unfortunately, we are not able to run "gluster volume status", which
> > hangs for a moment, and in the end we get "Error : Request timed out ".
> > Most of the log files are clean, except for
> > /var/log/glusterfs/etc-glusterfs-glusterd.vol.log. See below for some of
> > the contents
> >
> > [2016-06-13 12:57:01.054458] W [socket.c:870:__socket_keepalive]
> > 0-socket: failed to set TCP_USER_TIMEOUT -1000 on socket 45, Invalid
> > argument
> > [2016-06-13 12:57:01.054492] E [socket.c:2966:socket_connect]
> > 0-management: Failed to set keep-alive: Invalid argument
> > [2016-06-13 12:57:01.059023] W [socket.c:870:__socket_keepalive]
> > 0-socket: failed to set TCP_USER_TIMEOUT -1000 on socket 45, Invalid
> > argument
> > [2016-06-13 12:57:01.059042] E [socket.c:2966:socket_connect]
> > 0-management: Failed to set keep-alive: Invalid argument
> >
> > Any assistance on this would be much appreciated.
> >
> > [1]
> https://access.redhat.com/documentation/en-US/Red_Hat_Storage/3/html/Administration_Guide/sect-Replacing_Hosts.html#Replacing_a_Host_Machine_with_the_Same_Hostname
> >
> > --
> > Arif Ali
> >
> > IRC: arif-ali at freenode
> > LinkedIn: http://uk.linkedin.com/in/arifali
> >
> >
> > ___
> > Gluster-users mailing list
> > Gluster-users@gluster.org
> > http://www.gluster.org/mailman/listinfo/gluster-users
> >
>
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] issues recovering machine in gluster

2016-06-13 Thread Atin Mukherjee
Please send us the glusterd log file along with cmd_history.log from all
the 6 nodes. The logs you mentioned in the thread are not relevant to
debug the issue. Which gluster version are you using?

~Atin

On 06/13/2016 06:49 PM, Arif Ali wrote:
> Hi all,
> 
> Hopefully, someone can help
> 
> We have a 6 node gluster setup, and have successfully got the gluster
> system up and running, and had no issues with the initial install.
> 
> For other reasons, we had to re-provision the nodes, and therefore we
> had to go through some recovery steps to get the node back into the
> system. The documentation I used was [1].
> 
> The key thing is that everything in the documentation worked without a
> problem. The replication of gluster works, and can easily monitor that
> through the heal commands.
> 
> Unfortunately, we are not able to run "gluster volume status", which
> hangs for a moment, and in the end we get "Error : Request timed out ".
> Most of the log files are clean, except for
> /var/log/glusterfs/etc-glusterfs-glusterd.vol.log. See below for some of
> the contents
> 
> [2016-06-13 12:57:01.054458] W [socket.c:870:__socket_keepalive]
> 0-socket: failed to set TCP_USER_TIMEOUT -1000 on socket 45, Invalid
> argument
> [2016-06-13 12:57:01.054492] E [socket.c:2966:socket_connect]
> 0-management: Failed to set keep-alive: Invalid argument
> [2016-06-13 12:57:01.059023] W [socket.c:870:__socket_keepalive]
> 0-socket: failed to set TCP_USER_TIMEOUT -1000 on socket 45, Invalid
> argument
> [2016-06-13 12:57:01.059042] E [socket.c:2966:socket_connect]
> 0-management: Failed to set keep-alive: Invalid argument
> 
> Any assistance on this would be much appreciated.
> 
> [1] 
> https://access.redhat.com/documentation/en-US/Red_Hat_Storage/3/html/Administration_Guide/sect-Replacing_Hosts.html#Replacing_a_Host_Machine_with_the_Same_Hostname
> 
> --
> Arif Ali
> 
> IRC: arif-ali at freenode
> LinkedIn: http://uk.linkedin.com/in/arifali
> 
> 
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users
> 
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


[Gluster-users] issues recovering machine in gluster

2016-06-13 Thread Arif Ali
Hi all,

Hopefully, someone can help

We have a 6 node gluster setup, and have successfully got the gluster
system up and running, and had no issues with the initial install.

For other reasons, we had to re-provision the nodes, and therefore we had
to go through some recovery steps to get the node back into the system. The
documentation I used was [1].

The key thing is that everything in the documentation worked without a
problem. The replication of gluster works, and can easily monitor that
through the heal commands.

Unfortunately, we are not able to run "gluster volume status", which hangs
for a moment, and in the end we get "Error : Request timed out ". Most of
the log files are clean, except for
/var/log/glusterfs/etc-glusterfs-glusterd.vol.log. See below for some of
the contents

[2016-06-13 12:57:01.054458] W [socket.c:870:__socket_keepalive] 0-socket:
failed to set TCP_USER_TIMEOUT -1000 on socket 45, Invalid argument
[2016-06-13 12:57:01.054492] E [socket.c:2966:socket_connect] 0-management:
Failed to set keep-alive: Invalid argument
[2016-06-13 12:57:01.059023] W [socket.c:870:__socket_keepalive] 0-socket:
failed to set TCP_USER_TIMEOUT -1000 on socket 45, Invalid argument
[2016-06-13 12:57:01.059042] E [socket.c:2966:socket_connect] 0-management:
Failed to set keep-alive: Invalid argument

Any assistance on this would be much appreciated.

[1]
https://access.redhat.com/documentation/en-US/Red_Hat_Storage/3/html/Administration_Guide/sect-Replacing_Hosts.html#Replacing_a_Host_Machine_with_the_Same_Hostname

--
Arif Ali

IRC: arif-ali at freenode
LinkedIn: http://uk.linkedin.com/in/arifali
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users