Hi, So it seems some of the files in the volume have mismatching gfids. I see the following logs from 15th June, ~8pm EDT:
<snip> ... ... [2018-06-16 04:00:10.264690] E [MSGID: 108008] [afr-self-heal-common.c:335:afr_gfid_split_brain_source] 0-engine-replicate-0: Gfid mismatch detected for <gfid:941edf0c-d363-488e-a333-d12320f96480>/hosted-engine.lockspace>, 6bbe6097-8520-4a61-971e-6e30c2ee0abe on engine-client-2 and ef21a706-41cf-4519-8659-87ecde4bbfbf on engine-client-0. [2018-06-16 04:00:10.265861] W [fuse-bridge.c:540:fuse_entry_cbk] 0-glusterfs-fuse: 4411: LOOKUP() /c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace => -1 (Input/output error) [2018-06-16 04:00:11.522600] E [MSGID: 108008] [afr-self-heal-common.c:212:afr_gfid_split_brain_source] 0-engine-replicate-0: All the bricks should be up to resolve the gfid split barin [2018-06-16 04:00:11.522632] E [MSGID: 108008] [afr-self-heal-common.c:335:afr_gfid_split_brain_source] 0-engine-replicate-0: Gfid mismatch detected for <gfid:941edf0c-d363-488e-a333-d12320f96480>/hosted-engine.lockspace>, 6bbe6097-8520-4a61-971e-6e30c2ee0abe on engine-client-2 and ef21a706-41cf-4519-8659-87ecde4bbfbf on engine-client-0. [2018-06-16 04:00:11.523750] W [fuse-bridge.c:540:fuse_entry_cbk] 0-glusterfs-fuse: 4493: LOOKUP() /c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace => -1 (Input/output error) [2018-06-16 04:00:12.864393] E [MSGID: 108008] [afr-self-heal-common.c:212:afr_gfid_split_brain_source] 0-engine-replicate-0: All the bricks should be up to resolve the gfid split barin [2018-06-16 04:00:12.864426] E [MSGID: 108008] [afr-self-heal-common.c:335:afr_gfid_split_brain_source] 0-engine-replicate-0: Gfid mismatch detected for <gfid:941edf0c-d363-488e-a333-d12320f96480>/hosted-engine.lockspace>, 6bbe6097-8520-4a61-971e-6e30c2ee0abe on engine-client-2 and ef21a706-41cf-4519-8659-87ecde4bbfbf on engine-client-0. [2018-06-16 04:00:12.865392] W [fuse-bridge.c:540:fuse_entry_cbk] 0-glusterfs-fuse: 4575: LOOKUP() /c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace => -1 (Input/output error) [2018-06-16 04:00:18.716007] W [fuse-bridge.c:540:fuse_entry_cbk] 0-glusterfs-fuse: 4657: LOOKUP() /c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace => -1 (Input/output error) [2018-06-16 04:00:20.553365] W [fuse-bridge.c:540:fuse_entry_cbk] 0-glusterfs-fuse: 4739: LOOKUP() /c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace => -1 (Input/output error) [2018-06-16 04:00:21.771698] W [fuse-bridge.c:540:fuse_entry_cbk] 0-glusterfs-fuse: 4821: LOOKUP() /c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace => -1 (Input/output error) [2018-06-16 04:00:23.871647] W [fuse-bridge.c:540:fuse_entry_cbk] 0-glusterfs-fuse: 4906: LOOKUP() /c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace => -1 (Input/output error) [2018-06-16 04:00:25.034780] W [fuse-bridge.c:540:fuse_entry_cbk] 0-glusterfs-fuse: 4987: LOOKUP() /c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace => -1 (Input/output error) ... ... </snip> Adding Ravi who works on replicate component to hep resolve the mismatches. -Krutika On Mon, Jul 2, 2018 at 12:27 PM, Krutika Dhananjay <kdhan...@redhat.com> wrote: > Hi, > > Sorry, I was out sick on Friday. I am looking into the logs. Will get back > to you in some time. > > -Krutika > > On Fri, Jun 29, 2018 at 7:47 PM, Hanson Turner <han...@andrewswireless.net > > wrote: > >> Hi Krutika, >> >> Did you need any other logs? >> >> >> Thanks, >> >> Hanson >> >> On 06/27/2018 02:04 PM, Hanson Turner wrote: >> >> Hi Krutika, >> >> Looking at the email spams, it looks like it started at 8:04PM EDT on Jun >> 15 2018. >> >> From my memory, I think the cluster was working fine until sometime that >> night. Somewhere between midnight and the next (Saturday) morning, the >> engine crashed and all vm's stopped. >> >> I do have nightly backups that ran every night, using the engine-backup >> command. Looks like my last valid backup was 2018-06-15. >> >> I've included all logs I think might be of use. Please forgive the use of >> 7zip, as the raw logs took 50mb which is greater than my attachment limit. >> >> I think the just of what happened, is we had a downed node for a period >> of time. Earlier that day, the node was brought back into service. Later >> that night or early the next morning, the engine was gone and hopping from >> node to node. >> >> I have tried to mount the engine's hdd file to see if I could fix it. >> There are a few corrupted partitions, and those are xfs formatted. Trying >> to mount gives me issues about needing repaired, trying to repair gives me >> issues about needing something cleaned first. I cannot remember exactly >> what it was, but it wanted me to run a command that ended -L to clear out >> the logs. I said no way and have left the engine vm in a powered down >> state, as well as the cluster in global maintenance. >> >> I can see no sign of the vm booting, (ie no networking) except for what >> I've described earlier in the VNC session. >> >> >> Thanks, >> >> Hanson >> >> >> >> On 06/27/2018 12:04 PM, Krutika Dhananjay wrote: >> >> Yeah, complete logs would help. Also let me know when you saw this issue >> - data and approx time (do specify the timezone as well). >> >> -Krutika >> >> On Wed, Jun 27, 2018 at 7:00 PM, Hanson Turner < >> han...@andrewswireless.net> wrote: >> >>> #more rhev-data-center-mnt-glusterSD-ovirtnode1.abcxyzdomains.net\ >>> :_engine.log >>> [2018-06-24 07:39:12.161323] I [glusterfsd-mgmt.c:1888:mgmt_getspec_cbk] >>> 0-glusterfs: No change in volfile,continuing >>> >>> # more gluster_bricks-engine-engine.log >>> [2018-06-24 07:39:14.194222] I [glusterfsd-mgmt.c:1888:mgmt_getspec_cbk] >>> 0-glusterfs: No change in volfile,continuing >>> [2018-06-24 19:58:28.608469] E [MSGID: 101063] >>> [event-epoll.c:551:event_dispatch_epoll_handler] 0-epoll: stale fd >>> found on idx=12, gen=1, events=1, slot->gen=3 >>> [2018-06-25 14:24:19.716822] I [addr.c:55:compare_addr_and_update] >>> 0-/gluster_bricks/engine/engine: allowed = "*", received addr = >>> "192.168.0.57" >>> [2018-06-25 14:24:19.716868] I [MSGID: 115029] >>> [server-handshake.c:793:server_setvolume] 0-engine-server: accepted >>> client from CTX_ID:79b9d5b7-0bbb-4d67-87cf-11e27dfb6c1d-GRAPH_ID:0-PID:9 >>> 901-HOST:sp3Kali-PC_NAME:engine-client-0-RECON_NO:-0 (version: 4.0.2) >>> [2018-06-25 14:45:35.061350] I [MSGID: 115036] >>> [server.c:527:server_rpc_notify] 0-engine-server: disconnecting >>> connection from CTX_ID:79b9d5b7-0bbb-4d67-87cf >>> -11e27dfb6c1d-GRAPH_ID:0-PID:9901-HOST:sp3Kali-PC_NAME:engin >>> e-client-0-RECON_NO:-0 >>> [2018-06-25 14:45:35.061415] I [MSGID: 115013] >>> [server-helpers.c:289:do_fd_cleanup] 0-engine-server: fd cleanup on >>> /c65e03f0-d553-4d5d-ba4f-9d378c153b9b/images/82cde976-0650-4 >>> db9-9487-e2b52ffe25ee/e53806d9-3de5-4b26-aadc-157d745a9e0a >>> [2018-06-25 14:45:35.062290] I [MSGID: 101055] >>> [client_t.c:443:gf_client_unref] 0-engine-server: Shutting down >>> connection CTX_ID:79b9d5b7-0bbb-4d67-87cf-11e27dfb6c1d-GRAPH_ID:0-PID:9 >>> 901-HOST:sp3Kali-PC_NAME:engine-client-0-RECON_NO:-0 >>> [2018-06-25 14:46:34.284195] I [MSGID: 115036] >>> [server.c:527:server_rpc_notify] 0-engine-server: disconnecting >>> connection from CTX_ID:13e88614-31e8-4618-9f7f >>> -067750f5971e-GRAPH_ID:0-PID:2615-HOST:workbench-PC_NAME:eng >>> ine-client-0-RECON_NO:-0 >>> [2018-06-25 14:46:34.284546] I [MSGID: 101055] >>> [client_t.c:443:gf_client_unref] 0-engine-server: Shutting down >>> connection CTX_ID:13e88614-31e8-4618-9f7f-067750f5971e-GRAPH_ID:0-PID:2 >>> 615-HOST:workbench-PC_NAME:engine-client-0-RECON_NO:-0 >>> >>> >>> # gluster volume info engine >>> >>> Volume Name: engine >>> Type: Replicate >>> Volume ID: c8dc1b04-bc25-4e97-81bb-4d94929918b1 >>> Status: Started >>> Snapshot Count: 0 >>> Number of Bricks: 1 x 3 = 3 >>> Transport-type: tcp >>> Bricks: >>> Brick1: ovirtnode1.core.abcxyzdomains.net:/gluster_bricks/engine/engine >>> Brick2: ovirtnode3.core.abcxyzdomains.net:/gluster_bricks/engine/engine >>> Brick3: ovirtnode4.core.abcxyzdomains.net:/gluster_bricks/engine/engine >>> Options Reconfigured: >>> performance.strict-write-ordering: off >>> server.event-threads: 4 >>> client.event-threads: 4 >>> features.shard-block-size: 512MB >>> cluster.granular-entry-heal: enable >>> performance.strict-o-direct: off >>> network.ping-timeout: 30 >>> storage.owner-gid: 36 >>> storage.owner-uid: 36 >>> user.cifs: off >>> features.shard: on >>> cluster.shd-wait-qlength: 10000 >>> cluster.shd-max-threads: 8 >>> cluster.locking-scheme: granular >>> cluster.data-self-heal-algorithm: full >>> cluster.server-quorum-type: server >>> cluster.quorum-type: auto >>> cluster.eager-lock: enable >>> network.remote-dio: off >>> performance.low-prio-threads: 32 >>> performance.io-cache: off >>> performance.read-ahead: off >>> performance.quick-read: off >>> transport.address-family: inet >>> nfs.disable: on >>> performance.client-io-threads: off >>> >>> # gluster --version >>> glusterfs 3.12.9 >>> Repository revision: git://git.gluster.org/glusterfs.git >>> Copyright (c) 2006-2016 Red Hat, Inc. <https://www.gluster.org/> >>> <https://www.gluster.org/> >>> GlusterFS comes with ABSOLUTELY NO WARRANTY. >>> It is licensed to you under your choice of the GNU Lesser >>> General Public License, version 3 or any later version (LGPLv3 >>> or later), or the GNU General Public License, version 2 (GPLv2), >>> in all cases as published by the Free Software Foundation. >>> >>> Let me know if you want log further back, I can attach and send directly >>> to you. >>> >>> Thanks, >>> >>> Hanson >>> >>> >>> >>> On 06/26/2018 12:30 AM, Krutika Dhananjay wrote: >>> >>> Could you share the gluster mount and brick logs? You'll find them >>> under /var/log/glusterfs. >>> Also, what's the version of gluster you're using? >>> Also, output of `gluster volume info <ENGINE_VOLNAME>`? >>> >>> -Krutika >>> >>> On Thu, Jun 21, 2018 at 9:50 AM, Sahina Bose <sab...@redhat.com> wrote: >>> >>>> >>>> >>>> On Wed, Jun 20, 2018 at 11:33 PM, Hanson Turner < >>>> han...@andrewswireless.net> wrote: >>>> >>>>> Hi Benny, >>>>> >>>>> Who should I be reaching out to for help with a gluster based hosted >>>>> engine corruption? >>>>> >>>> >>>> >>>> Krutika, could you help? >>>> >>>> >>>>> >>>>> --== Host 1 status ==-- >>>>> >>>>> conf_on_shared_storage : True >>>>> Status up-to-date : True >>>>> Hostname : ovirtnode1.abcxyzdomains.net >>>>> Host ID : 1 >>>>> Engine status : {"reason": "failed liveliness >>>>> check", "health": "bad", "vm": "up", "detail": "Up"} >>>>> Score : 3400 >>>>> stopped : False >>>>> Local maintenance : False >>>>> crc32 : 92254a68 >>>>> local_conf_timestamp : 115910 >>>>> Host timestamp : 115910 >>>>> Extra metadata (valid at timestamp): >>>>> metadata_parse_version=1 >>>>> metadata_feature_version=1 >>>>> timestamp=115910 (Mon Jun 18 09:43:20 2018) >>>>> host-id=1 >>>>> score=3400 >>>>> vm_conf_refresh_time=115910 (Mon Jun 18 09:43:20 2018) >>>>> conf_on_shared_storage=True >>>>> maintenance=False >>>>> state=GlobalMaintenance >>>>> stopped=False >>>>> >>>>> >>>>> My when I VNC into my HE, All I get is: >>>>> Probing EDD (edd=off to disable)... ok >>>>> >>>>> >>>>> So, that's why it's failing the liveliness check... I cannot get the >>>>> screen on HE to change short of ctl-alt-del which will reboot the HE. >>>>> I do have backups for the HE that are/were run on a nightly basis. >>>>> >>>>> If the cluster was left alone, the HE vm would bounce from machine to >>>>> machine trying to boot. This is why the cluster is in maintenance mode. >>>>> One of the nodes was down for a period of time and brought back, >>>>> sometime through the night, which is when the automated backup kicks, the >>>>> HE started bouncing around. Got nearly 1000 emails. >>>>> >>>>> This seems to be the same error (but may not be the same cause) as >>>>> listed here: >>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1569827 >>>>> >>>>> Thanks, >>>>> >>>>> Hanson >>>>> >>>>> >>>>> _______________________________________________ >>>>> Users mailing list -- users@ovirt.org >>>>> To unsubscribe send an email to users-le...@ovirt.org >>>>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >>>>> oVirt Code of Conduct: https://www.ovirt.org/communit >>>>> y/about/community-guidelines/ >>>>> List Archives: https://lists.ovirt.org/archiv >>>>> es/list/users@ovirt.org/message/3NLA2URX3KN44FGFUVV4N5EJBPICABHH/ >>>>> >>>>> >>>> >>> >>> >> >> >> >
_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/N4ZGCK5Q4VWTC6NMIT6SSTNGPQKFFSHI/