OK. Do you also have granular-entry-heal on - just so that I can isolate the problem area.
-Krutika On Tue, Aug 30, 2016 at 9:55 AM, Darrell Budic <bu...@onholyground.com> wrote: > I noticed that my new brick (replacement disk) did not have a .shard > directory created on the brick, if that helps. > > I removed the affected brick from the volume and then wiped the disk, did > an add-brick, and everything healed right up. I didn’t try and set any > attrs or anything else, just removed and added the brick as new. > > On Aug 29, 2016, at 9:49 AM, Darrell Budic <bu...@onholyground.com> wrote: > > Just to let you know I’m seeing the same issue under 3.7.14 on CentOS 7. > Some content was healed correctly, now all the shards are queued up in a > heal list, but nothing is healing. Got similar brick errors logged to the > ones David was getting on the brick that isn’t healing: > > [2016-08-29 03:31:40.436110] E [MSGID: 115050] > [server-rpc-fops.c:179:server_lookup_cbk] 0-gv0-rep-server: 1613822: > LOOKUP (null) > (00000000-0000-0000-0000-000000000000/0f61bf63-8ef1-4e53-8bc3-6d46590c4fb1.29) > ==> (Invalid argument) [Invalid argument] > [2016-08-29 03:31:43.005013] E [MSGID: 115050] > [server-rpc-fops.c:179:server_lookup_cbk] 0-gv0-rep-server: 1616802: > LOOKUP (null) > (00000000-0000-0000-0000-000000000000/0f61bf63-8ef1-4e53-8bc3-6d46590c4fb1.40) > ==> (Invalid argument) [Invalid argument] > > This was after replacing the drive the brick was on and trying to get it > back into the system by setting the volume's fattr on the brick dir. I’ll > try the suggested method here on it it shortly. > > -Darrell > > > On Aug 29, 2016, at 7:25 AM, Krutika Dhananjay <kdhan...@redhat.com> > wrote: > > Got it. Thanks. > > I tried the same test and shd crashed with SIGABRT (well, that's because I > compiled from src with -DDEBUG). > In any case, this error would prevent full heal from proceeding further. > I'm debugging the crash now. Will let you know when I have the RC. > > -Krutika > > On Mon, Aug 29, 2016 at 5:47 PM, David Gossage < > dgoss...@carouselchecks.com> wrote: > >> >> On Mon, Aug 29, 2016 at 7:14 AM, David Gossage < >> dgoss...@carouselchecks.com> wrote: >> >>> On Mon, Aug 29, 2016 at 5:25 AM, Krutika Dhananjay <kdhan...@redhat.com> >>> wrote: >>> >>>> Could you attach both client and brick logs? Meanwhile I will try these >>>> steps out on my machines and see if it is easily recreatable. >>>> >>>> >>> Hoping 7z files are accepted by mail server. >>> >> >> looks like zip file awaiting approval due to size >> >>> >>> -Krutika >>>> >>>> On Mon, Aug 29, 2016 at 2:31 PM, David Gossage < >>>> dgoss...@carouselchecks.com> wrote: >>>> >>>>> Centos 7 Gluster 3.8.3 >>>>> >>>>> Brick1: ccgl1.gl.local:/gluster1/BRICK1/1 >>>>> Brick2: ccgl2.gl.local:/gluster1/BRICK1/1 >>>>> Brick3: ccgl4.gl.local:/gluster1/BRICK1/1 >>>>> Options Reconfigured: >>>>> cluster.data-self-heal-algorithm: full >>>>> cluster.self-heal-daemon: on >>>>> cluster.locking-scheme: granular >>>>> features.shard-block-size: 64MB >>>>> features.shard: on >>>>> performance.readdir-ahead: on >>>>> storage.owner-uid: 36 >>>>> storage.owner-gid: 36 >>>>> performance.quick-read: off >>>>> performance.read-ahead: off >>>>> performance.io-cache: off >>>>> performance.stat-prefetch: on >>>>> cluster.eager-lock: enable >>>>> network.remote-dio: enable >>>>> cluster.quorum-type: auto >>>>> cluster.server-quorum-type: server >>>>> server.allow-insecure: on >>>>> cluster.self-heal-window-size: 1024 >>>>> cluster.background-self-heal-count: 16 >>>>> performance.strict-write-ordering: off >>>>> nfs.disable: on >>>>> nfs.addr-namelookup: off >>>>> nfs.enable-ino32: off >>>>> cluster.granular-entry-heal: on >>>>> >>>>> Friday did rolling upgrade from 3.8.3->3.8.3 no issues. >>>>> Following steps detailed in previous recommendations began proces of >>>>> replacing and healngbricks one node at a time. >>>>> >>>>> 1) kill pid of brick >>>>> 2) reconfigure brick from raid6 to raid10 >>>>> 3) recreate directory of brick >>>>> 4) gluster volume start <> force >>>>> 5) gluster volume heal <> full >>>>> >>>>> 1st node worked as expected took 12 hours to heal 1TB data. Load was >>>>> little heavy but nothing shocking. >>>>> >>>>> About an hour after node 1 finished I began same process on node2. >>>>> Heal proces kicked in as before and the files in directories visible from >>>>> mount and .glusterfs healed in short time. Then it began crawl of .shard >>>>> adding those files to heal count at which point the entire proces ground >>>>> to >>>>> a halt basically. After 48 hours out of 19k shards it has added 5900 to >>>>> heal list. Load on all 3 machnes is negligible. It was suggested to >>>>> change this value to full cluster.data-self-heal-algorithm and >>>>> restart volume which I did. No efffect. Tried relaunching heal no >>>>> effect, >>>>> despite any node picked. I started each VM and performed a stat of all >>>>> files from within it, or a full virus scan and that seemed to cause short >>>>> small spikes in shards added, but not by much. Logs are showing no real >>>>> messages indicating anything is going on. I get hits to brick log on >>>>> occasion of null lookups making me think its not really crawling shards >>>>> directory but waiting for a shard lookup to add it. I'll get following in >>>>> brick log but not constant and sometime multiple for same shard. >>>>> >>>>> [2016-08-29 08:31:57.478125] W [MSGID: 115009] >>>>> [server-resolve.c:569:server_resolve] 0-GLUSTER1-server: no >>>>> resolution type for (null) (LOOKUP) >>>>> [2016-08-29 08:31:57.478170] E [MSGID: 115050] >>>>> [server-rpc-fops.c:156:server_lookup_cbk] 0-GLUSTER1-server: >>>>> 12591783: LOOKUP (null) (00000000-0000-0000-00 >>>>> 00-000000000000/241a55ed-f0d5-4dbc-a6ce-ab784a0ba6ff.221) ==> >>>>> (Invalid argument) [Invalid argument] >>>>> >>>>> This one repeated about 30 times in row then nothing for 10 minutes >>>>> then one hit for one different shard by itself. >>>>> >>>>> How can I determine if Heal is actually running? How can I kill it or >>>>> force restart? Does node I start it from determine which directory gets >>>>> crawled to determine heals? >>>>> >>>>> *David Gossage* >>>>> *Carousel Checks Inc. | System Administrator* >>>>> *Office* 708.613.2284 >>>>> >>>>> _______________________________________________ >>>>> Gluster-users mailing list >>>>> Gluster-users@gluster.org >>>>> http://www.gluster.org/mailman/listinfo/gluster-users >>>>> >>>> >>>> >>> >> > _______________________________________________ > Gluster-users mailing list > Gluster-users@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-users > > > _______________________________________________ > Gluster-users mailing list > Gluster-users@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-users > > >
_______________________________________________ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users