Could you attach both client and brick logs? Meanwhile I will try these steps out on my machines and see if it is easily recreatable.
-Krutika On Mon, Aug 29, 2016 at 2:31 PM, David Gossage <dgoss...@carouselchecks.com> wrote: > Centos 7 Gluster 3.8.3 > > Brick1: ccgl1.gl.local:/gluster1/BRICK1/1 > Brick2: ccgl2.gl.local:/gluster1/BRICK1/1 > Brick3: ccgl4.gl.local:/gluster1/BRICK1/1 > Options Reconfigured: > cluster.data-self-heal-algorithm: full > cluster.self-heal-daemon: on > cluster.locking-scheme: granular > features.shard-block-size: 64MB > features.shard: on > performance.readdir-ahead: on > storage.owner-uid: 36 > storage.owner-gid: 36 > performance.quick-read: off > performance.read-ahead: off > performance.io-cache: off > performance.stat-prefetch: on > cluster.eager-lock: enable > network.remote-dio: enable > cluster.quorum-type: auto > cluster.server-quorum-type: server > server.allow-insecure: on > cluster.self-heal-window-size: 1024 > cluster.background-self-heal-count: 16 > performance.strict-write-ordering: off > nfs.disable: on > nfs.addr-namelookup: off > nfs.enable-ino32: off > cluster.granular-entry-heal: on > > Friday did rolling upgrade from 3.8.3->3.8.3 no issues. > Following steps detailed in previous recommendations began proces of > replacing and healngbricks one node at a time. > > 1) kill pid of brick > 2) reconfigure brick from raid6 to raid10 > 3) recreate directory of brick > 4) gluster volume start <> force > 5) gluster volume heal <> full > > 1st node worked as expected took 12 hours to heal 1TB data. Load was > little heavy but nothing shocking. > > About an hour after node 1 finished I began same process on node2. Heal > proces kicked in as before and the files in directories visible from mount > and .glusterfs healed in short time. Then it began crawl of .shard adding > those files to heal count at which point the entire proces ground to a halt > basically. After 48 hours out of 19k shards it has added 5900 to heal > list. Load on all 3 machnes is negligible. It was suggested to change > this value to full cluster.data-self-heal-algorithm and restart volume > which I did. No efffect. Tried relaunching heal no effect, despite any > node picked. I started each VM and performed a stat of all files from > within it, or a full virus scan and that seemed to cause short small > spikes in shards added, but not by much. Logs are showing no real messages > indicating anything is going on. I get hits to brick log on occasion of > null lookups making me think its not really crawling shards directory but > waiting for a shard lookup to add it. I'll get following in brick log but > not constant and sometime multiple for same shard. > > [2016-08-29 08:31:57.478125] W [MSGID: 115009] > [server-resolve.c:569:server_resolve] 0-GLUSTER1-server: no resolution > type for (null) (LOOKUP) > [2016-08-29 08:31:57.478170] E [MSGID: 115050] > [server-rpc-fops.c:156:server_lookup_cbk] 0-GLUSTER1-server: 12591783: > LOOKUP (null) (00000000-0000-0000-00 > 00-000000000000/241a55ed-f0d5-4dbc-a6ce-ab784a0ba6ff.221) ==> (Invalid > argument) [Invalid argument] > > This one repeated about 30 times in row then nothing for 10 minutes then > one hit for one different shard by itself. > > How can I determine if Heal is actually running? How can I kill it or > force restart? Does node I start it from determine which directory gets > crawled to determine heals? > > *David Gossage* > *Carousel Checks Inc. | System Administrator* > *Office* 708.613.2284 > > _______________________________________________ > Gluster-users mailing list > Gluster-users@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-users >
_______________________________________________ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users