[ovirt-users] Re: HE + Gluster : Engine corrupted?

Krutika Dhananjay Mon, 02 Jul 2018 01:48:01 -0700

Hi,

So it seems some of the files in the volume have mismatching gfids. I see
the following logs from 15th June, ~8pm EDT:


<snip>
...
...
[2018-06-16 04:00:10.264690] E [MSGID: 108008]
[afr-self-heal-common.c:335:afr_gfid_split_brain_source]
0-engine-replicate-0: Gfid mismatch detected for
<gfid:941edf0c-d363-488e-a333-d12320f96480>/hosted-engine.lockspace>,
6bbe6097-8520-4a61-971e-6e30c2ee0abe on engine-client-2 and
ef21a706-41cf-4519-8659-87ecde4bbfbf on engine-client-0.
[2018-06-16 04:00:10.265861] W [fuse-bridge.c:540:fuse_entry_cbk]
0-glusterfs-fuse: 4411: LOOKUP()
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace =>
-1 (Input/output error)
[2018-06-16 04:00:11.522600] E [MSGID: 108008]
[afr-self-heal-common.c:212:afr_gfid_split_brain_source]
0-engine-replicate-0: All the bricks should be up to resolve the gfid split
barin
[2018-06-16 04:00:11.522632] E [MSGID: 108008]
[afr-self-heal-common.c:335:afr_gfid_split_brain_source]
0-engine-replicate-0: Gfid mismatch detected for
<gfid:941edf0c-d363-488e-a333-d12320f96480>/hosted-engine.lockspace>,
6bbe6097-8520-4a61-971e-6e30c2ee0abe on engine-client-2 and
ef21a706-41cf-4519-8659-87ecde4bbfbf on engine-client-0.
[2018-06-16 04:00:11.523750] W [fuse-bridge.c:540:fuse_entry_cbk]
0-glusterfs-fuse: 4493: LOOKUP()
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace =>
-1 (Input/output error)
[2018-06-16 04:00:12.864393] E [MSGID: 108008]
[afr-self-heal-common.c:212:afr_gfid_split_brain_source]
0-engine-replicate-0: All the bricks should be up to resolve the gfid split
barin
[2018-06-16 04:00:12.864426] E [MSGID: 108008]
[afr-self-heal-common.c:335:afr_gfid_split_brain_source]
0-engine-replicate-0: Gfid mismatch detected for
<gfid:941edf0c-d363-488e-a333-d12320f96480>/hosted-engine.lockspace>,
6bbe6097-8520-4a61-971e-6e30c2ee0abe on engine-client-2 and
ef21a706-41cf-4519-8659-87ecde4bbfbf on engine-client-0.
[2018-06-16 04:00:12.865392] W [fuse-bridge.c:540:fuse_entry_cbk]
0-glusterfs-fuse: 4575: LOOKUP()
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace =>
-1 (Input/output error)
[2018-06-16 04:00:18.716007] W [fuse-bridge.c:540:fuse_entry_cbk]
0-glusterfs-fuse: 4657: LOOKUP()
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace =>
-1 (Input/output error)
[2018-06-16 04:00:20.553365] W [fuse-bridge.c:540:fuse_entry_cbk]
0-glusterfs-fuse: 4739: LOOKUP()
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace =>
-1 (Input/output error)
[2018-06-16 04:00:21.771698] W [fuse-bridge.c:540:fuse_entry_cbk]
0-glusterfs-fuse: 4821: LOOKUP()
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace =>
-1 (Input/output error)
[2018-06-16 04:00:23.871647] W [fuse-bridge.c:540:fuse_entry_cbk]
0-glusterfs-fuse: 4906: LOOKUP()
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace =>
-1 (Input/output error)
[2018-06-16 04:00:25.034780] W [fuse-bridge.c:540:fuse_entry_cbk]
0-glusterfs-fuse: 4987: LOOKUP()
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace =>
-1 (Input/output error)
...
...
</snip>

Adding Ravi who works on replicate component to hep resolve the mismatches.

-Krutika


On Mon, Jul 2, 2018 at 12:27 PM, Krutika Dhananjay <kdhan...@redhat.com>
wrote:

> Hi,
>
> Sorry, I was out sick on Friday. I am looking into the logs. Will get back
> to you in some time.
>
> -Krutika
>
> On Fri, Jun 29, 2018 at 7:47 PM, Hanson Turner <han...@andrewswireless.net
> > wrote:
>
>> Hi Krutika,
>>
>> Did you need any other logs?
>>
>>
>> Thanks,
>>
>> Hanson
>>
>> On 06/27/2018 02:04 PM, Hanson Turner wrote:
>>
>> Hi Krutika,
>>
>> Looking at the email spams, it looks like it started at 8:04PM EDT on Jun
>> 15 2018.
>>
>> From my memory, I think the cluster was working fine until sometime that
>> night. Somewhere between midnight and the next (Saturday) morning, the
>> engine crashed and all vm's stopped.
>>
>> I do have nightly backups that ran every night, using the engine-backup
>> command. Looks like my last valid backup was 2018-06-15.
>>
>> I've included all logs I think might be of use. Please forgive the use of
>> 7zip, as the raw logs took 50mb which is greater than my attachment limit.
>>
>> I think the just of what happened, is we had a downed node for a period
>> of time. Earlier that day, the node was brought back into service. Later
>> that night or early the next morning, the engine was gone and hopping from
>> node to node.
>>
>> I have tried to mount the engine's hdd file to see if I could fix it.
>> There are a few corrupted partitions, and those are xfs formatted. Trying
>> to mount gives me issues about needing repaired, trying to repair gives me
>> issues about needing something cleaned first. I cannot remember exactly
>> what it was, but it wanted me to run a command that ended -L to clear out
>> the logs. I said no way and have left the engine vm in a powered down
>> state, as well as the cluster in global maintenance.
>>
>> I can see no sign of the vm booting, (ie no networking) except for what
>> I've described earlier in the VNC session.
>>
>>
>> Thanks,
>>
>> Hanson
>>
>>
>>
>> On 06/27/2018 12:04 PM, Krutika Dhananjay wrote:
>>
>> Yeah, complete logs would help. Also let me know when you saw this issue
>> - data and approx time (do specify the timezone as well).
>>
>> -Krutika
>>
>> On Wed, Jun 27, 2018 at 7:00 PM, Hanson Turner <
>> han...@andrewswireless.net> wrote:
>>
>>> #more rhev-data-center-mnt-glusterSD-ovirtnode1.abcxyzdomains.net\
>>> :_engine.log
>>> [2018-06-24 07:39:12.161323] I [glusterfsd-mgmt.c:1888:mgmt_getspec_cbk]
>>> 0-glusterfs: No change in volfile,continuing
>>>
>>> # more gluster_bricks-engine-engine.log
>>> [2018-06-24 07:39:14.194222] I [glusterfsd-mgmt.c:1888:mgmt_getspec_cbk]
>>> 0-glusterfs: No change in volfile,continuing
>>> [2018-06-24 19:58:28.608469] E [MSGID: 101063]
>>> [event-epoll.c:551:event_dispatch_epoll_handler] 0-epoll: stale fd
>>> found on idx=12, gen=1, events=1, slot->gen=3
>>> [2018-06-25 14:24:19.716822] I [addr.c:55:compare_addr_and_update]
>>> 0-/gluster_bricks/engine/engine: allowed = "*", received addr =
>>> "192.168.0.57"
>>> [2018-06-25 14:24:19.716868] I [MSGID: 115029]
>>> [server-handshake.c:793:server_setvolume] 0-engine-server: accepted
>>> client from CTX_ID:79b9d5b7-0bbb-4d67-87cf-11e27dfb6c1d-GRAPH_ID:0-PID:9
>>> 901-HOST:sp3Kali-PC_NAME:engine-client-0-RECON_NO:-0 (version: 4.0.2)
>>> [2018-06-25 14:45:35.061350] I [MSGID: 115036]
>>> [server.c:527:server_rpc_notify] 0-engine-server: disconnecting
>>> connection from CTX_ID:79b9d5b7-0bbb-4d67-87cf
>>> -11e27dfb6c1d-GRAPH_ID:0-PID:9901-HOST:sp3Kali-PC_NAME:engin
>>> e-client-0-RECON_NO:-0
>>> [2018-06-25 14:45:35.061415] I [MSGID: 115013]
>>> [server-helpers.c:289:do_fd_cleanup] 0-engine-server: fd cleanup on
>>> /c65e03f0-d553-4d5d-ba4f-9d378c153b9b/images/82cde976-0650-4
>>> db9-9487-e2b52ffe25ee/e53806d9-3de5-4b26-aadc-157d745a9e0a
>>> [2018-06-25 14:45:35.062290] I [MSGID: 101055]
>>> [client_t.c:443:gf_client_unref] 0-engine-server: Shutting down
>>> connection CTX_ID:79b9d5b7-0bbb-4d67-87cf-11e27dfb6c1d-GRAPH_ID:0-PID:9
>>> 901-HOST:sp3Kali-PC_NAME:engine-client-0-RECON_NO:-0
>>> [2018-06-25 14:46:34.284195] I [MSGID: 115036]
>>> [server.c:527:server_rpc_notify] 0-engine-server: disconnecting
>>> connection from CTX_ID:13e88614-31e8-4618-9f7f
>>> -067750f5971e-GRAPH_ID:0-PID:2615-HOST:workbench-PC_NAME:eng
>>> ine-client-0-RECON_NO:-0
>>> [2018-06-25 14:46:34.284546] I [MSGID: 101055]
>>> [client_t.c:443:gf_client_unref] 0-engine-server: Shutting down
>>> connection CTX_ID:13e88614-31e8-4618-9f7f-067750f5971e-GRAPH_ID:0-PID:2
>>> 615-HOST:workbench-PC_NAME:engine-client-0-RECON_NO:-0
>>>
>>>
>>> # gluster volume info engine
>>>
>>> Volume Name: engine
>>> Type: Replicate
>>> Volume ID: c8dc1b04-bc25-4e97-81bb-4d94929918b1
>>> Status: Started
>>> Snapshot Count: 0
>>> Number of Bricks: 1 x 3 = 3
>>> Transport-type: tcp
>>> Bricks:
>>> Brick1: ovirtnode1.core.abcxyzdomains.net:/gluster_bricks/engine/engine
>>> Brick2: ovirtnode3.core.abcxyzdomains.net:/gluster_bricks/engine/engine
>>> Brick3: ovirtnode4.core.abcxyzdomains.net:/gluster_bricks/engine/engine
>>> Options Reconfigured:
>>> performance.strict-write-ordering: off
>>> server.event-threads: 4
>>> client.event-threads: 4
>>> features.shard-block-size: 512MB
>>> cluster.granular-entry-heal: enable
>>> performance.strict-o-direct: off
>>> network.ping-timeout: 30
>>> storage.owner-gid: 36
>>> storage.owner-uid: 36
>>> user.cifs: off
>>> features.shard: on
>>> cluster.shd-wait-qlength: 10000
>>> cluster.shd-max-threads: 8
>>> cluster.locking-scheme: granular
>>> cluster.data-self-heal-algorithm: full
>>> cluster.server-quorum-type: server
>>> cluster.quorum-type: auto
>>> cluster.eager-lock: enable
>>> network.remote-dio: off
>>> performance.low-prio-threads: 32
>>> performance.io-cache: off
>>> performance.read-ahead: off
>>> performance.quick-read: off
>>> transport.address-family: inet
>>> nfs.disable: on
>>> performance.client-io-threads: off
>>>
>>> # gluster --version
>>> glusterfs 3.12.9
>>> Repository revision: git://git.gluster.org/glusterfs.git
>>> Copyright (c) 2006-2016 Red Hat, Inc. <https://www.gluster.org/>
>>> <https://www.gluster.org/>
>>> GlusterFS comes with ABSOLUTELY NO WARRANTY.
>>> It is licensed to you under your choice of the GNU Lesser
>>> General Public License, version 3 or any later version (LGPLv3
>>> or later), or the GNU General Public License, version 2 (GPLv2),
>>> in all cases as published by the Free Software Foundation.
>>>
>>> Let me know if you want log further back, I can attach and send directly
>>> to you.
>>>
>>> Thanks,
>>>
>>> Hanson
>>>
>>>
>>>
>>> On 06/26/2018 12:30 AM, Krutika Dhananjay wrote:
>>>
>>> Could you share the gluster mount and brick logs? You'll find  them
>>> under /var/log/glusterfs.
>>> Also, what's the version of gluster you're using?
>>> Also, output of `gluster volume info <ENGINE_VOLNAME>`?
>>>
>>> -Krutika
>>>
>>> On Thu, Jun 21, 2018 at 9:50 AM, Sahina Bose <sab...@redhat.com> wrote:
>>>
>>>>
>>>>
>>>> On Wed, Jun 20, 2018 at 11:33 PM, Hanson Turner <
>>>> han...@andrewswireless.net> wrote:
>>>>
>>>>> Hi Benny,
>>>>>
>>>>> Who should I be reaching out to for help with a gluster based hosted
>>>>> engine corruption?
>>>>>
>>>>
>>>>
>>>> Krutika, could you help?
>>>>
>>>>
>>>>>
>>>>> --== Host 1 status ==--
>>>>>
>>>>> conf_on_shared_storage             : True
>>>>> Status up-to-date                  : True
>>>>> Hostname                           : ovirtnode1.abcxyzdomains.net
>>>>> Host ID                            : 1
>>>>> Engine status                      : {"reason": "failed liveliness
>>>>> check", "health": "bad", "vm": "up", "detail": "Up"}
>>>>> Score                              : 3400
>>>>> stopped                            : False
>>>>> Local maintenance                  : False
>>>>> crc32                              : 92254a68
>>>>> local_conf_timestamp               : 115910
>>>>> Host timestamp                     : 115910
>>>>> Extra metadata (valid at timestamp):
>>>>>     metadata_parse_version=1
>>>>>     metadata_feature_version=1
>>>>>     timestamp=115910 (Mon Jun 18 09:43:20 2018)
>>>>>     host-id=1
>>>>>     score=3400
>>>>>     vm_conf_refresh_time=115910 (Mon Jun 18 09:43:20 2018)
>>>>>     conf_on_shared_storage=True
>>>>>     maintenance=False
>>>>>     state=GlobalMaintenance
>>>>>     stopped=False
>>>>>
>>>>>
>>>>> My when I VNC into my HE, All I get is:
>>>>> Probing EDD (edd=off to disable)... ok
>>>>>
>>>>>
>>>>> So, that's why it's failing the liveliness check... I cannot get the
>>>>> screen on HE to change short of ctl-alt-del which will reboot the HE.
>>>>> I do have backups for the HE that are/were run on a nightly basis.
>>>>>
>>>>> If the cluster was left alone, the HE vm would bounce from machine to
>>>>> machine trying to boot. This is why the cluster is in maintenance mode.
>>>>> One of the nodes was down for a period of time and brought back,
>>>>> sometime through the night, which is when the automated backup kicks, the
>>>>> HE started bouncing around. Got nearly 1000 emails.
>>>>>
>>>>> This seems to be the same error (but may not be the same cause) as
>>>>> listed here:
>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1569827
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Hanson
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Users mailing list -- users@ovirt.org
>>>>> To unsubscribe send an email to users-le...@ovirt.org
>>>>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
>>>>> oVirt Code of Conduct: https://www.ovirt.org/communit
>>>>> y/about/community-guidelines/
>>>>> List Archives: https://lists.ovirt.org/archiv
>>>>> es/list/users@ovirt.org/message/3NLA2URX3KN44FGFUVV4N5EJBPICABHH/
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>>
>>
>

_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/N4ZGCK5Q4VWTC6NMIT6SSTNGPQKFFSHI/

[ovirt-users] Re: HE + Gluster : Engine corrupted?

Reply via email to