Re: [Gluster-users] Sharding problem - multiple shard copies with mismatching gfids

Ian Halliday Fri, 06 Apr 2018 06:19:54 -0700

Raghavendra,

Thanks! I'll get you this info within the next few days and will file abug report at the same time.

For what its worth, we were able to reproduce the issue on a completelynew cluster running 3.13. The IO pattern that most easily causes it tofail is a VM image format with XFS. Formatting VMS with Ext4 will createthe additional shard files, but the GFIDs will usually match. I'm notsure if there are supposed to be 2 identical shard filenames, with onebeing empty, but they don't seem to cause VMs to pause or fail when theGFID matches.

Both of these clusters are pure SSD (one replica 3 arbiter 1, the otherreplica 3). I haven't seen any issues with our non-SSD clusters yet, butthey aren't pushed as hard.


Ian

------ Original Message ------
From: "Raghavendra Gowdappa" <rgowd...@redhat.com>
To: "Ian Halliday" <ihalli...@ndevix.com>

Cc: "Krutika Dhananjay" <kdhan...@redhat.com>; "gluster-user"<gluster-users@gluster.org>; "Nithya Balachandran" <nbala...@redhat.com>

Sent: 4/5/2018 10:39:47 PM

Subject: Re: Re[2]: [Gluster-users] Sharding problem - multiple shardcopies with mismatching gfids

Sorry for the delay, Ian :).
This looks to be a genuine issue which requires some effort in fixingit. Can you file a bug? I need following information attached to bug:
* Client and bricks logs. If you can reproduce the issue, please setdiagnostics.client-log-level and diagnostics.brick-log-level to TRACE.If you cannot reproduce the issue or if you cannot accommodate such biglogs, please set the log-level to DEBUG.* If possible a simple reproducer. A simple script or steps areappreciated.* strace of VM (to find out I/O pattern). If possible, dump of trafficbetween kernel and glusterfs. This can be captured by mountingglusterfs using --dump-fuse option.
Note that the logs you've posted here captures the scenario _after_ theshard file has gone into bad state. But I need information on what ledto that situation. So, please start collecting this diagnosticinformation as early as you can.
regards,
Raghavendra
On Tue, Apr 3, 2018 at 7:52 AM, Ian Halliday <ihalli...@ndevix.com>wrote:
Raghavendra,

Sorry for the late follow up. I have some more data on the issue.
The issue tends to happen when the shards are created. The easiesttime to reproduce this is during an initial VM disk format. This is alog from a test VM that was launched, and then partitioned andformatted with LVM / XFS:
[2018-04-03 02:05:00.838440] W [MSGID: 109048][dht-common.c:9732:dht_rmdir_cached_lookup_cbk] 0-ovirt-350-zone1-dht:/489c6fb7-fe61-4407-8160-35c0aac40c85/images/_remove_me_9a0660e1-bd86-47ea-8e09-865c14f11f26/e2645bd1-a7f3-4cbd-9036-3d3cbc7204cd.metafound on cached subvol ovirt-350-zone1-replicate-5[2018-04-03 02:07:57.967489] I [MSGID: 109070][dht-common.c:2796:dht_lookup_linkfile_cbk] 0-ovirt-350-zone1-dht:Lookup of /.shard/927c6620-848b-4064-8c88-68a332b645c2.7 onovirt-350-zone1-replicate-3 (following linkfile) failed ,gfid =00000000-0000-0000-0000-000000000000 [No such file or directory][2018-04-03 02:07:57.974815] I [MSGID: 109069][dht-common.c:2095:dht_lookup_unlink_stale_linkto_cbk]0-ovirt-350-zone1-dht: Returned with op_ret 0 and op_errno 0 for/.shard/927c6620-848b-4064-8c88-68a332b645c2.3[2018-04-03 02:07:57.979851] W [MSGID: 109009][dht-common.c:2831:dht_lookup_linkfile_cbk] 0-ovirt-350-zone1-dht:/.shard/927c6620-848b-4064-8c88-68a332b645c2.3: gfid different on datafile on ovirt-350-zone1-replicate-3, gfid local =00000000-0000-0000-0000-000000000000, gfid node =55f86aa0-e7a0-4075-b46b-a11f8bdbbceb[2018-04-03 02:07:57.980716] W [MSGID: 109009][dht-common.c:2570:dht_lookup_everywhere_cbk] 0-ovirt-350-zone1-dht:/.shard/927c6620-848b-4064-8c88-68a332b645c2.3: gfid differs onsubvolume ovirt-350-zone1-replicate-3, gfid local =b1e3f299-32ff-497e-918b-090e957090f6, gfid node =55f86aa0-e7a0-4075-b46b-a11f8bdbbceb[2018-04-03 02:07:57.980763] E [MSGID: 133010][shard.c:1724:shard_common_lookup_shards_cbk] 0-ovirt-350-zone1-shard:Lookup on shard 3 failed. Base file gfid =927c6620-848b-4064-8c88-68a332b645c2 [Stale file handle][2018-04-03 02:07:57.983016] I [MSGID: 109069][dht-common.c:2095:dht_lookup_unlink_stale_linkto_cbk]0-ovirt-350-zone1-dht: Returned with op_ret 0 and op_errno 0 for/.shard/927c6620-848b-4064-8c88-68a332b645c2.7[2018-04-03 02:07:57.988761] W [MSGID: 109009][dht-common.c:2570:dht_lookup_everywhere_cbk] 0-ovirt-350-zone1-dht:/.shard/927c6620-848b-4064-8c88-68a332b645c2.3: gfid differs onsubvolume ovirt-350-zone1-replicate-3, gfid local =b1e3f299-32ff-497e-918b-090e957090f6, gfid node =55f86aa0-e7a0-4075-b46b-a11f8bdbbceb[2018-04-03 02:07:57.988844] W [MSGID: 109009][dht-common.c:2831:dht_lookup_linkfile_cbk] 0-ovirt-350-zone1-dht:/.shard/927c6620-848b-4064-8c88-68a332b645c2.7: gfid different on datafile on ovirt-350-zone1-replicate-3, gfid local =00000000-0000-0000-0000-000000000000, gfid node =955a5e78-ab4c-499a-89f8-511e041167fb[2018-04-03 02:07:57.989748] W [MSGID: 109009][dht-common.c:2570:dht_lookup_everywhere_cbk] 0-ovirt-350-zone1-dht:/.shard/927c6620-848b-4064-8c88-68a332b645c2.7: gfid differs onsubvolume ovirt-350-zone1-replicate-3, gfid local =efbb9be5-0744-4883-8f3e-e8f7ce8d7741, gfid node =955a5e78-ab4c-499a-89f8-511e041167fb[2018-04-03 02:07:57.989827] I [MSGID: 109069][dht-common.c:2095:dht_lookup_unlink_stale_linkto_cbk]0-ovirt-350-zone1-dht: Returned with op_ret -1 and op_errno 2 for/.shard/927c6620-848b-4064-8c88-68a332b645c2.3[2018-04-03 02:07:57.989832] E [MSGID: 133010][shard.c:1724:shard_common_lookup_shards_cbk] 0-ovirt-350-zone1-shard:Lookup on shard 7 failed. Base file gfid =927c6620-848b-4064-8c88-68a332b645c2 [Stale file handle]The message "W [MSGID: 109009][dht-common.c:2831:dht_lookup_linkfile_cbk] 0-ovirt-350-zone1-dht:/.shard/927c6620-848b-4064-8c88-68a332b645c2.3: gfid different on datafile on ovirt-350-zone1-replicate-3, gfid local =00000000-0000-0000-0000-000000000000, gfid node =55f86aa0-e7a0-4075-b46b-a11f8bdbbceb " repeated 2 times between[2018-04-03 02:07:57.979851] and [2018-04-03 02:07:57.995739][2018-04-03 02:07:57.996644] W [MSGID: 109009][dht-common.c:2570:dht_lookup_everywhere_cbk] 0-ovirt-350-zone1-dht:/.shard/927c6620-848b-4064-8c88-68a332b645c2.3: gfid differs onsubvolume ovirt-350-zone1-replicate-3, gfid local =0a701104-e9a2-44c0-8181-4a9a6edecf9f, gfid node =55f86aa0-e7a0-4075-b46b-a11f8bdbbceb[2018-04-03 02:07:57.996761] E [MSGID: 133010][shard.c:1724:shard_common_lookup_shards_cbk] 0-ovirt-350-zone1-shard:Lookup on shard 3 failed. Base file gfid =927c6620-848b-4064-8c88-68a332b645c2 [Stale file handle][2018-04-03 02:07:57.998986] W [MSGID: 109009][dht-common.c:2831:dht_lookup_linkfile_cbk] 0-ovirt-350-zone1-dht:/.shard/927c6620-848b-4064-8c88-68a332b645c2.3: gfid different on datafile on ovirt-350-zone1-replicate-3, gfid local =00000000-0000-0000-0000-000000000000, gfid node =55f86aa0-e7a0-4075-b46b-a11f8bdbbceb[2018-04-03 02:07:57.999857] W [MSGID: 109009][dht-common.c:2570:dht_lookup_everywhere_cbk] 0-ovirt-350-zone1-dht:/.shard/927c6620-848b-4064-8c88-68a332b645c2.3: gfid differs onsubvolume ovirt-350-zone1-replicate-3, gfid local =0a701104-e9a2-44c0-8181-4a9a6edecf9f, gfid node =55f86aa0-e7a0-4075-b46b-a11f8bdbbceb[2018-04-03 02:07:57.999899] E [MSGID: 133010][shard.c:1724:shard_common_lookup_shards_cbk] 0-ovirt-350-zone1-shard:Lookup on shard 3 failed. Base file gfid =927c6620-848b-4064-8c88-68a332b645c2 [Stale file handle][2018-04-03 02:07:57.999942] W [fuse-bridge.c:896:fuse_attr_cbk]0-glusterfs-fuse: 22338: FSTAT()/489c6fb7-fe61-4407-8160-35c0aac40c85/images/a717e25c-f108-4367-9d28-9235bd432bb7/5a8e541e-8883-4dec-8afd-aa29f38ef502=> -1 (Stale file handle)[2018-04-03 02:07:57.987941] I [MSGID: 109069][dht-common.c:2095:dht_lookup_unlink_stale_linkto_cbk]0-ovirt-350-zone1-dht: Returned with op_ret 0 and op_errno 0 for/.shard/927c6620-848b-4064-8c88-68a332b645c2.3
Duplicate shards are created. Output from one of the gluster nodes:

# find -name 927c6620-848b-4064-8c88-68a332b645c2.*
./brick1/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.19
./brick1/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.9
./brick1/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.7
./brick3/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.5
./brick3/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.3
./brick4/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.19
./brick4/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.9
./brick4/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.5
./brick4/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.3
./brick4/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.7
[root@n1 gluster]# getfattr -d -m . -e hex./brick1/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.19
# file: brick1/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.19
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.gfid=0x46083184a0e5468e89e6cc1db0bfc63b
trusted.gfid2path.77528eefc6a11c45=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f39323763363632302d383438622d343036342d386338382d3638613333326236343563322e3139
trusted.glusterfs.dht.linkto=0x6f766972742d3335302d7a6f6e65312d7265706c69636174652d3300
[root@n1 gluster]# getfattr -d -m . -e hex./brick4/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.19
# file: brick4/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.19
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.gfid=0x46083184a0e5468e89e6cc1db0bfc63b
trusted.gfid2path.77528eefc6a11c45=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f39323763363632302d383438622d343036342d386338382d3638613333326236343563322e3139


In the above example, the shard on Brick 1 is the bad one.
At this point, the VM will pause with an unknown storage error andwill not boot until the offending shards are removed.
# gluster volume info
Volume Name: ovirt-350-zone1
Type: Distributed-Replicate
Volume ID: 106738ed-9951-4270-822e-63c9bcd0a20e
Status: Started
Snapshot Count: 0
Number of Bricks: 7 x (2 + 1) = 21
Transport-type: tcp
Bricks:
Brick1: 10.0.6.100:/gluster/brick1/brick
Brick2: 10.0.6.101:/gluster/brick1/brick
Brick3: 10.0.6.102:/gluster/arbrick1/brick (arbiter)
Brick4: 10.0.6.100:/gluster/brick2/brick
Brick5: 10.0.6.101:/gluster/brick2/brick
Brick6: 10.0.6.102:/gluster/arbrick2/brick (arbiter)
Brick7: 10.0.6.100:/gluster/brick3/brick
Brick8: 10.0.6.101:/gluster/brick3/brick
Brick9: 10.0.6.102:/gluster/arbrick3/brick (arbiter)
Brick10: 10.0.6.100:/gluster/brick4/brick
Brick11: 10.0.6.101:/gluster/brick4/brick
Brick12: 10.0.6.102:/gluster/arbrick4/brick (arbiter)
Brick13: 10.0.6.100:/gluster/brick5/brick
Brick14: 10.0.6.101:/gluster/brick5/brick
Brick15: 10.0.6.102:/gluster/arbrick5/brick (arbiter)
Brick16: 10.0.6.100:/gluster/brick6/brick
Brick17: 10.0.6.101:/gluster/brick6/brick
Brick18: 10.0.6.102:/gluster/arbrick6/brick (arbiter)
Brick19: 10.0.6.100:/gluster/brick7/brick
Brick20: 10.0.6.101:/gluster/brick7/brick
Brick21: 10.0.6.102:/gluster/arbrick7/brick (arbiter)
Options Reconfigured:
cluster.server-quorum-type: server
cluster.data-self-heal-algorithm: full
performance.client-io-threads: off
server.allow-insecure: on
client.event-threads: 8
storage.owner-gid: 36
storage.owner-uid: 36
server.event-threads: 16
features.shard-block-size: 5GB
features.shard: on
transport.address-family: inet
nfs.disable: yes

Any suggestions?


-- Ian


------ Original Message ------
From: "Raghavendra Gowdappa" <rgowd...@redhat.com>
To: "Krutika Dhananjay" <kdhan...@redhat.com>
Cc: "Ian Halliday" <ihalli...@ndevix.com>; "gluster-user"<gluster-users@gluster.org>; "Nithya Balachandran"<nbala...@redhat.com>
Sent: 3/26/2018 2:37:21 AM
Subject: Re: [Gluster-users] Sharding problem - multiple shard copieswith mismatching gfids
Ian,
Do you've a reproducer for this bug? If not a specific one, a generaloutline of what operations where done on the file will help.
regards,
Raghavendra
On Mon, Mar 26, 2018 at 12:55 PM, Raghavendra Gowdappa<rgowd...@redhat.com> wrote:
On Mon, Mar 26, 2018 at 12:40 PM, Krutika Dhananjay<kdhan...@redhat.com> wrote:
The gfid mismatch here is between the shard and its "link-to" file,the creation of which happens at a layer below that of shardtranslator on the stack.
Adding DHT devs to take a look.
Thanks Krutika. I assume shard doesn't do any dentry operations likerename, link, unlink on the path of file (not the gfid handle basedpath) internally while managing shards. Can you confirm? If it doesthese operations, what fops does it do?
@Ian,

I can suggest following way to fix the problem:
* Since one of files listed is a DHT linkto file, I am assumingthere is only one shard of the file. If not, please list out gfidsof other shards and don't proceed with healing procedure.* If gfids of all shards happen to be same and only linkto has adifferent gfid, please proceed to step 3. Otherwise abort thehealing procedure.* If cluster.lookup-optimize is set to true abort the healingprocedure* Delete the linkto file - the file with permissions -------T andxattr trusted.dht.linkto and do a lookup on the file from mountpoint after turning off readdriplus [1].
As to reasons on how we ended up in this situation, Can you explainme what is the I/O pattern on this file - like are there lots ofentry operations like rename, link, unlink etc on the file? Therehave been known races in rename/lookup-heal-creating-linkto wherelinkto and data file have different gfids. [2] fixes some of thesecases
[1]http://lists.gluster.org/pipermail/gluster-users/2017-March/030148.html<http://lists.gluster.org/pipermail/gluster-users/2017-March/030148.html>[2] https://review.gluster.org/#/c/19547/<https://review.gluster.org/#/c/19547/>
regards,
Raghavendra
-Krutika
On Mon, Mar 26, 2018 at 1:09 AM, Ian Halliday<ihalli...@ndevix.com> wrote:
Hello all,
We are having a rather interesting problem with one of our VMstorage systems. The GlusterFS client is throwing errors relatingto GFID mismatches. We traced this down to multiple shards beingpresent on the gluster nodes, with different gfids.
Hypervisor gluster mount log:
[2018-03-25 18:54:19.261733] E [MSGID: 133010][shard.c:1724:shard_common_lookup_shards_cbk] 0-ovirt-zone1-shard:Lookup on shard 7 failed. Base file gfid =87137cac-49eb-492a-8f33-8e33470d8cb7 [Stale file handle]The message "W [MSGID: 109009][dht-common.c:2162:dht_lookup_linkfile_cbk] 0-ovirt-zone1-dht:/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7: gfid different ondata file on ovirt-zone1-replicate-3, gfid local =00000000-0000-0000-0000-000000000000, gfid node =57c6fcdf-52bb-4f7a-aea4-02f0dc81ff56 " repeated 2 times between[2018-03-25 18:54:19.253748] and [2018-03-25 18:54:19.263576][2018-03-25 18:54:19.264349] W [MSGID: 109009][dht-common.c:1901:dht_lookup_everywhere_cbk] 0-ovirt-zone1-dht:/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7: gfid differs onsubvolume ovirt-zone1-replicate-3, gfid local =fdf0813b-718a-4616-a51b-6999ebba9ec3, gfid node =57c6fcdf-52bb-4f7a-aea4-02f0dc81ff56
On the storage nodes, we found this:
[root@n1 gluster]# find -name87137cac-49eb-492a-8f33-8e33470d8cb7.7
./brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
./brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
[root@n1 gluster]# ls -lh./brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7---------T. 2 root root 0 Mar 25 13:55./brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7[root@n1 gluster]# ls -lh./brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7-rw-rw----. 2 root root 3.8G Mar 25 13:55./brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
[root@n1 gluster]# getfattr -d -m . -e hex./brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
# file: brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.gfid=0xfdf0813b718a4616a51b6999ebba9ec3
trusted.glusterfs.dht.linkto=0x6f766972742d3335302d7a6f6e65312d7265706c69636174652d3300
[root@n1 gluster]# getfattr -d -m . -e hex./brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
# file: brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x020000000000000059914190000ce672
trusted.gfid=0x57c6fcdf52bb4f7aaea402f0dc81ff56
I'm wondering how they got created in the first place, and ifanyone has any insight on how to fix it?
Storage nodes:
[root@n1 gluster]# gluster --version
glusterfs 4.0.0

[root@n1 gluster]# gluster volume info

Volume Name: ovirt-350-zone1
Type: Distributed-Replicate
Volume ID: 106738ed-9951-4270-822e-63c9bcd0a20e
Status: Started
Snapshot Count: 0
Number of Bricks: 7 x (2 + 1) = 21
Transport-type: tcp
Bricks:
Brick1: 10.0.6.100:/gluster/brick1/brick
Brick2: 10.0.6.101:/gluster/brick1/brick
Brick3: 10.0.6.102:/gluster/arbrick1/brick (arbiter)
Brick4: 10.0.6.100:/gluster/brick2/brick
Brick5: 10.0.6.101:/gluster/brick2/brick
Brick6: 10.0.6.102:/gluster/arbrick2/brick (arbiter)
Brick7: 10.0.6.100:/gluster/brick3/brick
Brick8: 10.0.6.101:/gluster/brick3/brick
Brick9: 10.0.6.102:/gluster/arbrick3/brick (arbiter)
Brick10: 10.0.6.100:/gluster/brick4/brick
Brick11: 10.0.6.101:/gluster/brick4/brick
Brick12: 10.0.6.102:/gluster/arbrick4/brick (arbiter)
Brick13: 10.0.6.100:/gluster/brick5/brick
Brick14: 10.0.6.101:/gluster/brick5/brick
Brick15: 10.0.6.102:/gluster/arbrick5/brick (arbiter)
Brick16: 10.0.6.100:/gluster/brick6/brick
Brick17: 10.0.6.101:/gluster/brick6/brick
Brick18: 10.0.6.102:/gluster/arbrick6/brick (arbiter)
Brick19: 10.0.6.100:/gluster/brick7/brick
Brick20: 10.0.6.101:/gluster/brick7/brick
Brick21: 10.0.6.102:/gluster/arbrick7/brick (arbiter)
Options Reconfigured:
cluster.min-free-disk: 50GB
performance.strict-write-ordering: off
performance.strict-o-direct: off
nfs.disable: off
performance.readdir-ahead: on
transport.address-family: inet
performance.cache-size: 1GB
features.shard: on
features.shard-block-size: 5GB
server.event-threads: 8
server.outstanding-rpc-limit: 128
storage.owner-uid: 36
storage.owner-gid: 36
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: on
cluster.eager-lock: enable
network.remote-dio: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
cluster.data-self-heal-algorithm: full
performance.flush-behind: off
performance.write-behind-window-size: 8MB
client.event-threads: 8
server.allow-insecure: on


Client version:
[root@kvm573 ~]# gluster --version
glusterfs 3.12.5


Thanks!

- Ian


_______________________________________________
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users<http://lists.gluster.org/mailman/listinfo/gluster-users>

_______________________________________________
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Sharding problem - multiple shard copies with mismatching gfids

Reply via email to