Re: replace gluster node

2018-08-26 Thread Tim Dudgeon

So I dug a bit deeper.
After the procedure I described in the previous post I looked at what 
the status was.


On one of the two gluster nodes/pods that was still working I see this:


# gluster pool list
UUID                    Hostname     State
6075f5a2-ce2f-4d4d-92a2-6850620c636e    10.0.0.24    Connected
ee33c338-c057-416d-81a4-c8d103570f18    10.0.0.39  Disconnected
48571acb-9a4b-4f4d-81bf-30291b14513e    localhost    Connected
The first is the other good node, the second is the failed node that no 
longer exists.

From the other good node the situation is similar.

From the new node I see this:


# gluster pool list
UUID                    Hostname     State
95e93b7d-9e96-4d76-925f-3f8aaf289ba2    localhost    Connected 
So clearly the new node is alive, gluster is running, but its not joined 
the storage pool.
On the node itself the volumes are present as devices (/dev/vdb, 
/dev/vdc, /dev/vdd) but they are not mounted.


So how best to rectify this situation?
Should this be done with gluster or with heketi?

Tim


On 25/08/18 15:11, Tim Dudgeon wrote:

Not having any joy with replacing the broken glusterfs node.

What we did was:

1. Delete the broken gluster node from the cluster, and remove it from 
the inventory file


2. Create a new node to replace it. Add it to the [new_nodes] section 
of the inventory and run the playbooks/byo/openshift-node/scaleup.yml 
playbook. At this stage it is not added to the [glusterfs] section of 
the inventory. The node is now part of the cluster. Move it from the 
[new_nodes] section of the inventory to the [nodes] section.


3. Add the new node to the  [glusterfs] section of the inventory. At 
this stage we have the 2 functioning gluster nodes with volumes 
containing data, and one new node with unformatted volumes.


4. Edit the [OSEv3:vars] section and add these 3 properties:
openshift_storage_glusterfs_wipe = False
openshift_storage_glusterfs_is_missing = False
openshift_storage_glusterfs_heketi_is_missing = False

5. Run the playbooks/byo/openshift-glusterfs/config.yml playbook. This 
fails with the following error:


TASK [openshift_storage_glusterfs : Load heketi topology] 

Saturday 25 August 2018  12:04:37 + (0:00:02.073) 0:26:39.480 
***
fatal: [orn-master.openstacklocal]: FAILED! => {"changed": true, 
"cmd": ["oc", 
"--config=/tmp/openshift-glusterfs-ansible-2Zl8Vv/admin.kubeconfig", 
"rsh", "--namespace=glusterfs", "heketi-storage-2-jvfhn", 
"heketi-cli", "-s", "http://localhost:8080;, "--user", "admin", 
"--secret", "sjGJ1Gix0Nf9GEaXynTSngMwi6D/fHtEEWxyZCSlVY8=", 
"topology", "load", 
"--json=/tmp/openshift-glusterfs-ansible-2Zl8Vv/topology.json", 
"2>&1"], "delta": "0:00:05.354851", "end": "2018-08-25 
12:10:10.128102", "failed_when_result": true, "rc": 0, "start": 
"2018-08-25 12:10:04.773251", "stderr": "", "stderr_lines": [], 
"stdout": "\tFound node orn-gluster-storage-001.openstacklocal on 
cluster de03021c7b9d5f6a99d403a7a369d3e1\n\t\tFound device 
/dev/vdb\n\t\tFound device /dev/vdc\n\t\tFound device 
/dev/vdd\n\tFound node orn-gluster-storage-002.openstacklocal on 
cluster de03021c7b9d5f6a99d403a7a369d3e1\n\t\tFound device 
/dev/vdb\n\t\tFound device /dev/vdc\n\t\tFound device 
/dev/vdd\n\tCreating node orn-gluster-storage-003.openstacklocal ... 
Unable to create node: Unable to execute command on 
glusterfs-storage-k7lp4: peer probe: failed: Probe returned with 
Transport endpoint is not connected", "stdout_lines": ["\tFound node 
orn-gluster-storage-001.openstacklocal on cluster 
de03021c7b9d5f6a99d403a7a369d3e1", "\t\tFound device /dev/vdb", 
"\t\tFound device /dev/vdc", "\t\tFound device /dev/vdd", "\tFound 
node orn-gluster-storage-002.openstacklocal on cluster 
de03021c7b9d5f6a99d403a7a369d3e1", "\t\tFound device /dev/vdb", 
"\t\tFound device /dev/vdc", "\t\tFound device /dev/vdd", "\tCreating 
node orn-gluster-storage-003.openstacklocal ... Unable to create 
node: Unable to execute command on glusterfs-storage-k7lp4: peer 
probe: failed: Probe returned with Transport endpoint is not 
connected"]}


Similarly, if you oc rsh to the heketi pod and run the heketi-cli you 
get a similar error:


heketi-cli node add --zone=1 --cluster=$CLUSTER_ID 
--management-host-name=orn-gluster-storage-003.openstacklocal 
--storage-host-name=10.0.0.26
Error: Unable to execute command on glusterfs-storage-k7lp4: peer 
probe: failed: Probe returned with Transport endpoint is not connected


Any thoughts how to repair this?


On 24/08/18 13:39, Walters, Todd wrote:

Tim,

Try deleting all the pods, the glusterfs pods and the heketi pod. Do 
it one at a time. I’ve had this work for me where the pods came back 
up and heketi was ok.


Also can try restarting glusterfs glusterd in the pod term on each 
pod. That’s worked for me to get out of heketi db issues.


Other than that I don’t have any other ideas. I’ve not 

Re: replace gluster node

2018-08-25 Thread Tim Dudgeon

Not having any joy with replacing the broken glusterfs node.

What we did was:

1. Delete the broken gluster node from the cluster, and remove it from 
the inventory file


2. Create a new node to replace it. Add it to the [new_nodes] section of 
the inventory and run the playbooks/byo/openshift-node/scaleup.yml 
playbook. At this stage it is not added to the [glusterfs] section of 
the inventory. The node is now part of the cluster. Move it from the 
[new_nodes] section of the inventory to the [nodes] section.


3. Add the new node to the  [glusterfs] section of the inventory. At 
this stage we have the 2 functioning gluster nodes with volumes 
containing data, and one new node with unformatted volumes.


4. Edit the [OSEv3:vars] section and add these 3 properties:
openshift_storage_glusterfs_wipe = False
openshift_storage_glusterfs_is_missing = False
openshift_storage_glusterfs_heketi_is_missing = False

5. Run the playbooks/byo/openshift-glusterfs/config.yml playbook. This 
fails with the following error:


TASK [openshift_storage_glusterfs : Load heketi topology] 


Saturday 25 August 2018  12:04:37 + (0:00:02.073) 0:26:39.480 ***
fatal: [orn-master.openstacklocal]: FAILED! => {"changed": true, 
"cmd": ["oc", 
"--config=/tmp/openshift-glusterfs-ansible-2Zl8Vv/admin.kubeconfig", 
"rsh", "--namespace=glusterfs", "heketi-storage-2-jvfhn", 
"heketi-cli", "-s", "http://localhost:8080;, "--user", "admin", 
"--secret", "sjGJ1Gix0Nf9GEaXynTSngMwi6D/fHtEEWxyZCSlVY8=", 
"topology", "load", 
"--json=/tmp/openshift-glusterfs-ansible-2Zl8Vv/topology.json", 
"2>&1"], "delta": "0:00:05.354851", "end": "2018-08-25 
12:10:10.128102", "failed_when_result": true, "rc": 0, "start": 
"2018-08-25 12:10:04.773251", "stderr": "", "stderr_lines": [], 
"stdout": "\tFound node orn-gluster-storage-001.openstacklocal on 
cluster de03021c7b9d5f6a99d403a7a369d3e1\n\t\tFound device 
/dev/vdb\n\t\tFound device /dev/vdc\n\t\tFound device 
/dev/vdd\n\tFound node orn-gluster-storage-002.openstacklocal on 
cluster de03021c7b9d5f6a99d403a7a369d3e1\n\t\tFound device 
/dev/vdb\n\t\tFound device /dev/vdc\n\t\tFound device 
/dev/vdd\n\tCreating node orn-gluster-storage-003.openstacklocal ... 
Unable to create node: Unable to execute command on 
glusterfs-storage-k7lp4: peer probe: failed: Probe returned with 
Transport endpoint is not connected", "stdout_lines": ["\tFound node 
orn-gluster-storage-001.openstacklocal on cluster 
de03021c7b9d5f6a99d403a7a369d3e1", "\t\tFound device /dev/vdb", 
"\t\tFound device /dev/vdc", "\t\tFound device /dev/vdd", "\tFound 
node orn-gluster-storage-002.openstacklocal on cluster 
de03021c7b9d5f6a99d403a7a369d3e1", "\t\tFound device /dev/vdb", 
"\t\tFound device /dev/vdc", "\t\tFound device /dev/vdd", "\tCreating 
node orn-gluster-storage-003.openstacklocal ... Unable to create node: 
Unable to execute command on glusterfs-storage-k7lp4: peer probe: 
failed: Probe returned with Transport endpoint is not connected"]}


Similarly, if you oc rsh to the heketi pod and run the heketi-cli you 
get a similar error:


heketi-cli node add --zone=1 --cluster=$CLUSTER_ID 
--management-host-name=orn-gluster-storage-003.openstacklocal 
--storage-host-name=10.0.0.26
Error: Unable to execute command on glusterfs-storage-k7lp4: peer 
probe: failed: Probe returned with Transport endpoint is not connected


Any thoughts how to repair this?


On 24/08/18 13:39, Walters, Todd wrote:

Tim,

Try deleting all the pods, the glusterfs pods and the heketi pod. Do it one at 
a time. I’ve had this work for me where the pods came back up and heketi was ok.

Also can try restarting glusterfs glusterd in the pod term on each pod. That’s 
worked for me to get out of heketi db issues.

Other than that I don’t have any other ideas. I’ve not found good information 
on how to resolve or troubleshoot issues like this.

Thanks,
Todd

On 8/24/18, 4:37 AM, "Tim Dudgeon"  wrote:

 Todd,

 Thanks for that. Seems on the lines that I need.

 The problem though is that I have an additional problem of the heketi
 pod not starting because of a messed up database configuration.
 These two problems happened independently, but on the same OpenShift
 environment.
 This means I'm unable to run the heketi-cli until that is fixed.
 I'm not sure if I can modify the heketi database configuration as
 described in the troubleshooting guide [1] so that it only knows about
 the two good gluster nodes, and then add back the third one?

 Any thoughts?

 Tim

 [1] 
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fheketi%2Fheketi%2Fblob%2Fmaster%2Fdocs%2Ftroubleshooting.mddata=01%7C01%7CTodd_Walters%40unigroup.com%7Cee469430dd3e43c27c2508d609a5284e%7C259bdc2f86d3477b8cb34eee64289142%7C1sdata=uzBTRZV59mpOQfdWWO%2FMFmwIZO8NIu6cpPp%2Bjz%2BD4Yc%3Dreserved=0



Re: replace gluster node

2018-08-24 Thread Tim Dudgeon

Todd,

Thanks for that. Seems on the lines that I need.

The problem though is that I have an additional problem of the heketi 
pod not starting because of a messed up database configuration.
These two problems happened independently, but on the same OpenShift 
environment.

This means I'm unable to run the heketi-cli until that is fixed.
I'm not sure if I can modify the heketi database configuration as 
described in the troubleshooting guide [1] so that it only knows about 
the two good gluster nodes, and then add back the third one?


Any thoughts?

Tim

[1] https://github.com/heketi/heketi/blob/master/docs/troubleshooting.md


On 23/08/18 17:14, Walters, Todd wrote:

Tim,

I have had this issue with 3 node cluster. I created a new node with new 
devices, ran scaleup and ran gluster playbook with some changes, then ran 
heketi-cli commands to add new node and remove old node.

For your other question, I’ve restarted all glusterfs pods and hekeit pod and 
resolved that issue before.  I guess you can restart glusterd in each pod too?

Here’s doc I wrote on node replacement. I’m not sure if this is proper 
procedure, but it works, and I wasn’t able to find any decent solution in the 
docs.

# - Replacing a Failed Node  #

Disable Node to simulate failure
Get node id with heketi-cli node list or topology info

heketi-cli node disable fb344a2ea889c7e25a772e747c2a -s http://localhost:8080 --user 
admin --secret "$HEKETI_CLI_KEY"
Node fb344a2ea889c7e25a772e747c2a is now offline

Stop Node in AWS Console
Scale up another node (4) for Gluster via Terraform
Run scaleup_node.yml playbook

Add New Node and Device

heketi-cli node add --zone=1 --cluster=441248c1b2f032a93aca4a4e03648b28 
--management-host-name=ip-new-node.ec2.internal --storage-host-name=newnodeIP  -s 
http://localhost:8080 --user admin --secret "$HEKETI_CLI_KEY"
heketi-cli device add --name /dev/xvdc --node 8973b41d8a4e437bd8b36d7df1a93f06 -s 
http://localhost:8080 --user admin --secret "$HEKETI_CLI_KEY"


Run deploy_gluster playbook, with the following changes in OSEv3

-   openshift_storage_glusterfs_wipe: False
-   openshift_storage_glusterfs_is_missing: False
-   openshift_storage_glusterfs_heketi_is_missing: False

Verify topology
rsh into heketi pod
run heketi-exports (file i created with export commands)
get old and new node info (id)

Remove Node

sh-4.4# heketi-cli node remove fb344a2ea889c7e25a772e747c2a -s http://localhost:8080 
--user admin --secret "$HEKETI_CLI_KEY"
Node fb344a2ea889c7e25a772e747c2a is now removed


Remove All Devices (check the topology)

sh-4.4# heketi-cli device delete ea85942eaec73cb666c4e3dcec8b3702 -s 
http://localhost:8080 --user admin --secret "$HEKETI_CLI_KEY"
Device ea85942eaec73cb666c4e3dcec8b3702 deleted


Delete the Node

sh-4.4# heketi-cli node delete fb344a2ea889c7e25a772e747c2a -s http://localhost:8080 
--user admin --secret "$HEKETI_CLI_KEY"
Node fb344a2ea889c7e25a772e747c2a deleted


Verify New Topology

$ heketi-cli topology info
make sure new node and device is listed.


Thanks,

Todd

# ---

Check any existing pvc is still accessible.
 Today's Topics:
2. Replacing failed gluster node (Tim Dudgeon)


--

 Message: 2
 Date: Thu, 23 Aug 2018 15:40:29 +0100
 From: Tim Dudgeon 
 To: users 
 Subject: Replacing failed gluster node
 Message-ID: 
 Content-Type: text/plain; charset=utf-8; format=flowed

 I have a 3 node containerised glusterfs setup, and one of the nodes has
 just died.
 I believe I can recover the disks that were used for the gluster storage.
 What is the best approach to replacing that node with a new one?
 Can I just create a new node with empty disks mounted and use the
 scaleup.yml playbook and [new_nodes] section, or should I be creating a
 node that re-uses the existing drives?

 Tim



 --

 ___
 users mailing list
 users@lists.openshift.redhat.com
 
https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.openshift.redhat.com%2Fopenshiftmm%2Flistinfo%2Fusersdata=01%7C01%7Ctodd_walters%40unigroup.com%7C5dae269e490e4932a8d609118137%7C259bdc2f86d3477b8cb34eee64289142%7C1sdata=VkWMmlYIrfuEnZMGBtAqf2QER8dMSkFkVFYBAStVits%3Dreserved=0


 End of users Digest, Vol 73, Issue 44
 *




The information contained in this message, and any attachments thereto,
is intended solely for the use of the addressee(s) and may contain
confidential and/or privileged material. Any review, retransmission,
dissemination, copying, or other use of the transmitted information is
prohibited. If you received this in error, please contact the sender
and delete the material from any computer. UNIGROUP.COM

Re: replace gluster node

2018-08-23 Thread Walters, Todd
Tim,

I have had this issue with 3 node cluster. I created a new node with new 
devices, ran scaleup and ran gluster playbook with some changes, then ran 
heketi-cli commands to add new node and remove old node.

For your other question, I’ve restarted all glusterfs pods and hekeit pod and 
resolved that issue before.  I guess you can restart glusterd in each pod too?

Here’s doc I wrote on node replacement. I’m not sure if this is proper 
procedure, but it works, and I wasn’t able to find any decent solution in the 
docs.

# - Replacing a Failed Node  #

Disable Node to simulate failure
Get node id with heketi-cli node list or topology info

heketi-cli node disable fb344a2ea889c7e25a772e747c2a -s 
http://localhost:8080 --user admin --secret "$HEKETI_CLI_KEY"
Node fb344a2ea889c7e25a772e747c2a is now offline

Stop Node in AWS Console
Scale up another node (4) for Gluster via Terraform
Run scaleup_node.yml playbook

Add New Node and Device

heketi-cli node add --zone=1 --cluster=441248c1b2f032a93aca4a4e03648b28 
--management-host-name=ip-new-node.ec2.internal --storage-host-name=newnodeIP  
-s http://localhost:8080 --user admin --secret "$HEKETI_CLI_KEY"
heketi-cli device add --name /dev/xvdc --node 8973b41d8a4e437bd8b36d7df1a93f06 
-s http://localhost:8080 --user admin --secret "$HEKETI_CLI_KEY"


Run deploy_gluster playbook, with the following changes in OSEv3

-   openshift_storage_glusterfs_wipe: False
-   openshift_storage_glusterfs_is_missing: False
-   openshift_storage_glusterfs_heketi_is_missing: False

Verify topology
rsh into heketi pod
run heketi-exports (file i created with export commands)
get old and new node info (id)

Remove Node

sh-4.4# heketi-cli node remove fb344a2ea889c7e25a772e747c2a -s 
http://localhost:8080 --user admin --secret "$HEKETI_CLI_KEY"
Node fb344a2ea889c7e25a772e747c2a is now removed


Remove All Devices (check the topology)

sh-4.4# heketi-cli device delete ea85942eaec73cb666c4e3dcec8b3702 -s 
http://localhost:8080 --user admin --secret "$HEKETI_CLI_KEY"
Device ea85942eaec73cb666c4e3dcec8b3702 deleted


Delete the Node

sh-4.4# heketi-cli node delete fb344a2ea889c7e25a772e747c2a -s 
http://localhost:8080 --user admin --secret "$HEKETI_CLI_KEY"
Node fb344a2ea889c7e25a772e747c2a deleted


Verify New Topology

$ heketi-cli topology info
make sure new node and device is listed.


Thanks,

Todd

# ---

Check any existing pvc is still accessible.
Today's Topics:
2. Replacing failed gluster node (Tim Dudgeon)


   --

Message: 2
Date: Thu, 23 Aug 2018 15:40:29 +0100
From: Tim Dudgeon 
To: users 
Subject: Replacing failed gluster node
Message-ID: 
Content-Type: text/plain; charset=utf-8; format=flowed

I have a 3 node containerised glusterfs setup, and one of the nodes has
just died.
I believe I can recover the disks that were used for the gluster storage.
What is the best approach to replacing that node with a new one?
Can I just create a new node with empty disks mounted and use the
scaleup.yml playbook and [new_nodes] section, or should I be creating a
node that re-uses the existing drives?

Tim



--

___
users mailing list
users@lists.openshift.redhat.com

https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.openshift.redhat.com%2Fopenshiftmm%2Flistinfo%2Fusersdata=01%7C01%7Ctodd_walters%40unigroup.com%7C5dae269e490e4932a8d609118137%7C259bdc2f86d3477b8cb34eee64289142%7C1sdata=VkWMmlYIrfuEnZMGBtAqf2QER8dMSkFkVFYBAStVits%3Dreserved=0


End of users Digest, Vol 73, Issue 44
*




The information contained in this message, and any attachments thereto,
is intended solely for the use of the addressee(s) and may contain
confidential and/or privileged material. Any review, retransmission,
dissemination, copying, or other use of the transmitted information is
prohibited. If you received this in error, please contact the sender
and delete the material from any computer. UNIGROUP.COM



___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users