Hi Mohammad, I was unable to reproduce this on a volume created on a system running 3.12.9.
Can you send me the FUSE volfiles for the volume atlasglust? They will be in /var/lib/glusterd/vols/atlasglust/ on any of the gluster servers hosting the volume and called *.tcp-fuse.vol. Thanks, Nithya On 14 June 2018 at 16:42, mohammad kashif <kashif.a...@gmail.com> wrote: > Hi Nithya > > It seems that problem can be solved by either turning parallel-readir off > or downgrading client to 3.10.12-1 . Yesterday I downgraded some clients to > 3.10.12-1 and it seems to fixed the problem. Today when I saw your email > then I disabled parallel-readir off and the current client 3.12.9-1 > started to work. I upgraded server and clients to 3.12.9-1 last month > and since then clients were intermittently unmounting once in a week. But > during last three days, it started unmounting every few minutes. I don't > know that what triggered this sudden panic except that file system was > quite full; around 98%. It is 480 TB file system. The file system has > almost 80 Million files. > > Servers have 64GB RAM and clients have 64GB to 192GB RAM. I tested with > 192GB RAM client and it still had the same issue. > > > Volume Name: atlasglust > Type: Distribute > Volume ID: fbf0ebb8-deab-4388-9d8a-f722618a624b > Status: Started > Snapshot Count: 0 > Number of Bricks: 7 > Transport-type: tcp > Bricks: > Brick1: pplxgluster01.X.Y.Z/glusteratlas/brick001/gv0 > Brick2: pplxgluster02.X.Y.Z:/glusteratlas/brick002/gv0 > Brick3: pplxgluster03.X.Y.Z:/glusteratlas/brick003/gv0 > Brick4: pplxgluster04.X.Y.Z:/glusteratlas/brick004/gv0 > Brick5: pplxgluster05.X.Y.Z:/glusteratlas/brick005/gv0 > Brick6: pplxgluster06.X.Y.Z:/glusteratlas/brick006/gv0 > Brick7: pplxgluster07.X.Y.Z:/glusteratlas/brick007/gv0 > Options Reconfigured: > diagnostics.client-log-level: ERROR > diagnostics.brick-log-level: ERROR > performance.cache-invalidation: on > server.event-threads: 4 > client.event-threads: 4 > cluster.lookup-optimize: on > performance.client-io-threads: on > performance.cache-size: 1GB > performance.parallel-readdir: off > performance.md-cache-timeout: 600 > performance.stat-prefetch: on > features.cache-invalidation-timeout: 600 > features.cache-invalidation: on > auth.allow: X.Y.Z.* > transport.address-family: inet > performance.readdir-ahead: on > nfs.disable: on > > > Thanks > > Kashif > > On Thu, Jun 14, 2018 at 5:39 AM, Nithya Balachandran <nbala...@redhat.com> > wrote: > >> +Poornima who works on parallel-readdir. >> >> @Poornima, Have you seen anything like this before? >> >> On 14 June 2018 at 10:07, Nithya Balachandran <nbala...@redhat.com> >> wrote: >> >>> This is not the same issue as the one you are referring - that was in >>> the RPC layer and caused the bricks to crash. This one is different as it >>> seems to be in the dht and rda layers. It does look like a stack overflow >>> though. >>> >>> @Mohammad, >>> >>> Please send the following information: >>> >>> 1. gluster volume info >>> 2. The number of entries in the directory being listed >>> 3. System memory >>> >>> Does this still happen if you turn off parallel-readdir? >>> >>> Regards, >>> Nithya >>> >>> >>> >>> >>> On 13 June 2018 at 16:40, Milind Changire <mchan...@redhat.com> wrote: >>> >>>> +Nithya >>>> >>>> Nithya, >>>> Do these logs [1] look similar to the recursive readdir() issue that >>>> you encountered just a while back ? >>>> i.e. recursive readdir() response definition in the XDR >>>> >>>> [1] http://www-pnp.physics.ox.ac.uk/~mohammad/backtrace.log >>>> >>>> >>>> On Wed, Jun 13, 2018 at 4:29 PM, mohammad kashif <kashif.a...@gmail.com >>>> > wrote: >>>> >>>>> Hi Milind >>>>> >>>>> Thanks a lot, I manage to run gdb and produced traceback as well. Its >>>>> here >>>>> >>>>> http://www-pnp.physics.ox.ac.uk/~mohammad/backtrace.log >>>>> >>>>> >>>>> I am trying to understand but still not able to make sense out of it. >>>>> >>>>> Thanks >>>>> >>>>> Kashif >>>>> >>>>> On Wed, Jun 13, 2018 at 11:34 AM, Milind Changire <mchan...@redhat.com >>>>> > wrote: >>>>> >>>>>> Kashif, >>>>>> FYI: http://debuginfo.centos.org/centos/6/storage/x86_64/ >>>>>> >>>>>> >>>>>> On Wed, Jun 13, 2018 at 3:21 PM, mohammad kashif < >>>>>> kashif.a...@gmail.com> wrote: >>>>>> >>>>>>> Hi Milind >>>>>>> >>>>>>> There is no glusterfs-debuginfo available for gluster-3.12 from >>>>>>> http://mirror.centos.org/centos/6/storage/x86_64/gluster-3.12/ >>>>>>> repo. Do you know from where I can get it? >>>>>>> Also when I run gdb, it says >>>>>>> >>>>>>> Missing separate debuginfos, use: debuginfo-install >>>>>>> glusterfs-fuse-3.12.9-1.el6.x86_64 >>>>>>> >>>>>>> I can't find debug package for glusterfs-fuse either >>>>>>> >>>>>>> Thanks from the pit of despair ;) >>>>>>> >>>>>>> Kashif >>>>>>> >>>>>>> >>>>>>> On Tue, Jun 12, 2018 at 5:01 PM, mohammad kashif < >>>>>>> kashif.a...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi Milind >>>>>>>> >>>>>>>> I will send you links for logs. >>>>>>>> >>>>>>>> I collected these core dumps at client and there is no glusterd >>>>>>>> process running on client. >>>>>>>> >>>>>>>> Kashif >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Jun 12, 2018 at 4:14 PM, Milind Changire < >>>>>>>> mchan...@redhat.com> wrote: >>>>>>>> >>>>>>>>> Kashif, >>>>>>>>> Could you also send over the client/mount log file as Vijay >>>>>>>>> suggested ? >>>>>>>>> Or maybe the lines with the crash backtrace lines >>>>>>>>> >>>>>>>>> Also, you've mentioned that you straced glusterd, but when you ran >>>>>>>>> gdb, you ran it over /usr/sbin/glusterfs >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Jun 12, 2018 at 8:19 PM, Vijay Bellur <vbel...@redhat.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, Jun 12, 2018 at 7:40 AM, mohammad kashif < >>>>>>>>>> kashif.a...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Milind >>>>>>>>>>> >>>>>>>>>>> The operating system is Scientific Linux 6 which is based on >>>>>>>>>>> RHEL6. The cpu arch is Intel x86_64. >>>>>>>>>>> >>>>>>>>>>> I will send you a separate email with link to core dump. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> You could also grep for crash in the client log file and the >>>>>>>>>> lines following crash would have a backtrace in most cases. >>>>>>>>>> >>>>>>>>>> HTH, >>>>>>>>>> Vijay >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Thanks for your help. >>>>>>>>>>> >>>>>>>>>>> Kashif >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Tue, Jun 12, 2018 at 3:16 PM, Milind Changire < >>>>>>>>>>> mchan...@redhat.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Kashif, >>>>>>>>>>>> Could you share the core dump via Google Drive or something >>>>>>>>>>>> similar >>>>>>>>>>>> >>>>>>>>>>>> Also, let me know the CPU arch and OS Distribution on which you >>>>>>>>>>>> are running gluster. >>>>>>>>>>>> >>>>>>>>>>>> If you've installed the glusterfs-debuginfo package, you'll >>>>>>>>>>>> also get the source lines in the backtrace via gdb >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Jun 12, 2018 at 5:59 PM, mohammad kashif < >>>>>>>>>>>> kashif.a...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Milind, Vijay >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, I have some more information now as I straced glusterd >>>>>>>>>>>>> on client >>>>>>>>>>>>> >>>>>>>>>>>>> 138544 0.000131 mprotect(0x7f2f70785000, 4096, >>>>>>>>>>>>> PROT_READ|PROT_WRITE) = 0 <0.000026> >>>>>>>>>>>>> 138544 0.000128 mprotect(0x7f2f70786000, 4096, >>>>>>>>>>>>> PROT_READ|PROT_WRITE) = 0 <0.000027> >>>>>>>>>>>>> 138544 0.000126 mprotect(0x7f2f70787000, 4096, >>>>>>>>>>>>> PROT_READ|PROT_WRITE) = 0 <0.000027> >>>>>>>>>>>>> 138544 0.000124 --- SIGSEGV {si_signo=SIGSEGV, >>>>>>>>>>>>> si_code=SEGV_ACCERR, si_addr=0x7f2f7c60ef88} --- >>>>>>>>>>>>> 138544 0.000051 --- SIGSEGV {si_signo=SIGSEGV, >>>>>>>>>>>>> si_code=SI_KERNEL, si_addr=0} --- >>>>>>>>>>>>> 138551 0.105048 +++ killed by SIGSEGV (core dumped) +++ >>>>>>>>>>>>> 138550 0.000041 +++ killed by SIGSEGV (core dumped) +++ >>>>>>>>>>>>> 138547 0.000008 +++ killed by SIGSEGV (core dumped) +++ >>>>>>>>>>>>> 138546 0.000007 +++ killed by SIGSEGV (core dumped) +++ >>>>>>>>>>>>> 138545 0.000007 +++ killed by SIGSEGV (core dumped) +++ >>>>>>>>>>>>> 138544 0.000008 +++ killed by SIGSEGV (core dumped) +++ >>>>>>>>>>>>> 138543 0.000007 +++ killed by SIGSEGV (core dumped) +++ >>>>>>>>>>>>> >>>>>>>>>>>>> As for I understand that somehow gluster is trying to access >>>>>>>>>>>>> memory in appropriate manner and kernel sends SIGSEGV >>>>>>>>>>>>> >>>>>>>>>>>>> I also got the core dump. I am trying gdb first time so I am >>>>>>>>>>>>> not sure whether I am using it correctly >>>>>>>>>>>>> >>>>>>>>>>>>> gdb /usr/sbin/glusterfs core.138536 >>>>>>>>>>>>> >>>>>>>>>>>>> It just tell me that program terminated with signal 11, >>>>>>>>>>>>> segmentation fault . >>>>>>>>>>>>> >>>>>>>>>>>>> The problem is not limited to one client but happening to many >>>>>>>>>>>>> clients. >>>>>>>>>>>>> >>>>>>>>>>>>> I will really appreciate any help as whole file system has >>>>>>>>>>>>> become unusable >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks >>>>>>>>>>>>> >>>>>>>>>>>>> Kashif >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, Jun 12, 2018 at 12:26 PM, Milind Changire < >>>>>>>>>>>>> mchan...@redhat.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Kashif, >>>>>>>>>>>>>> You can change the log level by: >>>>>>>>>>>>>> $ gluster volume set <vol> diagnostics.brick-log-level TRACE >>>>>>>>>>>>>> $ gluster volume set <vol> diagnostics.client-log-level TRACE >>>>>>>>>>>>>> >>>>>>>>>>>>>> and see how things fare >>>>>>>>>>>>>> >>>>>>>>>>>>>> If you want fewer logs you can change the log-level to DEBUG >>>>>>>>>>>>>> instead of TRACE. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Tue, Jun 12, 2018 at 3:37 PM, mohammad kashif < >>>>>>>>>>>>>> kashif.a...@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Vijay >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Now it is unmounting every 30 mins ! >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The server log at >>>>>>>>>>>>>>> /var/log/glusterfs/bricks/glusteratlas-brics001-gv0.log >>>>>>>>>>>>>>> have this line only >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2018-06-12 09:53:19.303102] I [MSGID: 115013] >>>>>>>>>>>>>>> [server-helpers.c:289:do_fd_cleanup] 0-atlasglust-server: >>>>>>>>>>>>>>> fd cleanup on /atlas/atlasdata/zgubic/hmumu/ >>>>>>>>>>>>>>> histograms/v14.3/Signal >>>>>>>>>>>>>>> [2018-06-12 09:53:19.306190] I [MSGID: 101055] >>>>>>>>>>>>>>> [client_t.c:443:gf_client_unref] 0-atlasglust-server: >>>>>>>>>>>>>>> Shutting down connection <server-name> >>>>>>>>>>>>>>> -2224879-2018/06/12-09:51:01:4 >>>>>>>>>>>>>>> 60889-atlasglust-client-0-0-0 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> There is no other information. Is there any way to increase >>>>>>>>>>>>>>> log verbosity? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> on the client >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2018-06-12 09:51:01.744980] I [MSGID: 114057] >>>>>>>>>>>>>>> [client-handshake.c:1478:select_server_supported_programs] >>>>>>>>>>>>>>> 0-atlasglust-client-5: Using Program GlusterFS 3.3, Num >>>>>>>>>>>>>>> (1298437), Version >>>>>>>>>>>>>>> (330) >>>>>>>>>>>>>>> [2018-06-12 09:51:01.746508] I [MSGID: 114046] >>>>>>>>>>>>>>> [client-handshake.c:1231:client_setvolume_cbk] >>>>>>>>>>>>>>> 0-atlasglust-client-5: Connected to atlasglust-client-5, >>>>>>>>>>>>>>> attached to remote >>>>>>>>>>>>>>> volume '/glusteratlas/brick006/gv0'. >>>>>>>>>>>>>>> [2018-06-12 09:51:01.746543] I [MSGID: 114047] >>>>>>>>>>>>>>> [client-handshake.c:1242:client_setvolume_cbk] >>>>>>>>>>>>>>> 0-atlasglust-client-5: Server and Client lk-version numbers are >>>>>>>>>>>>>>> not same, >>>>>>>>>>>>>>> reopening the fds >>>>>>>>>>>>>>> [2018-06-12 09:51:01.746814] I [MSGID: 114035] >>>>>>>>>>>>>>> [client-handshake.c:202:client_set_lk_version_cbk] >>>>>>>>>>>>>>> 0-atlasglust-client-5: Server lk version = 1 >>>>>>>>>>>>>>> [2018-06-12 09:51:01.748449] I [MSGID: 114057] >>>>>>>>>>>>>>> [client-handshake.c:1478:select_server_supported_programs] >>>>>>>>>>>>>>> 0-atlasglust-client-6: Using Program GlusterFS 3.3, Num >>>>>>>>>>>>>>> (1298437), Version >>>>>>>>>>>>>>> (330) >>>>>>>>>>>>>>> [2018-06-12 09:51:01.750219] I [MSGID: 114046] >>>>>>>>>>>>>>> [client-handshake.c:1231:client_setvolume_cbk] >>>>>>>>>>>>>>> 0-atlasglust-client-6: Connected to atlasglust-client-6, >>>>>>>>>>>>>>> attached to remote >>>>>>>>>>>>>>> volume '/glusteratlas/brick007/gv0'. >>>>>>>>>>>>>>> [2018-06-12 09:51:01.750261] I [MSGID: 114047] >>>>>>>>>>>>>>> [client-handshake.c:1242:client_setvolume_cbk] >>>>>>>>>>>>>>> 0-atlasglust-client-6: Server and Client lk-version numbers are >>>>>>>>>>>>>>> not same, >>>>>>>>>>>>>>> reopening the fds >>>>>>>>>>>>>>> [2018-06-12 09:51:01.750503] I [MSGID: 114035] >>>>>>>>>>>>>>> [client-handshake.c:202:client_set_lk_version_cbk] >>>>>>>>>>>>>>> 0-atlasglust-client-6: Server lk version = 1 >>>>>>>>>>>>>>> [2018-06-12 09:51:01.752207] I >>>>>>>>>>>>>>> [fuse-bridge.c:4205:fuse_init] 0-glusterfs-fuse: FUSE inited >>>>>>>>>>>>>>> with protocol >>>>>>>>>>>>>>> versions: glusterfs 7.24 kernel 7.14 >>>>>>>>>>>>>>> [2018-06-12 09:51:01.752261] I >>>>>>>>>>>>>>> [fuse-bridge.c:4835:fuse_graph_sync] >>>>>>>>>>>>>>> 0-fuse: switched to graph 0 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> is there a problem with server and client 1k version? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks for your help. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Kashif >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Mon, Jun 11, 2018 at 11:52 PM, Vijay Bellur < >>>>>>>>>>>>>>> vbel...@redhat.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Mon, Jun 11, 2018 at 8:50 AM, mohammad kashif < >>>>>>>>>>>>>>>> kashif.a...@gmail.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Since I have updated our gluster server and client to >>>>>>>>>>>>>>>>> latest version 3.12.9-1, I am having this issue of gluster >>>>>>>>>>>>>>>>> getting >>>>>>>>>>>>>>>>> unmounted from client very regularly. It was not a problem >>>>>>>>>>>>>>>>> before update. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Its a distributed file system with no replication. We have >>>>>>>>>>>>>>>>> seven servers totaling around 480TB data. Its 97% full. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I am using following config on server >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> gluster volume set atlasglust features.cache-invalidation >>>>>>>>>>>>>>>>> on >>>>>>>>>>>>>>>>> gluster volume set atlasglust >>>>>>>>>>>>>>>>> features.cache-invalidation-timeout 600 >>>>>>>>>>>>>>>>> gluster volume set atlasglust performance.stat-prefetch on >>>>>>>>>>>>>>>>> gluster volume set atlasglust >>>>>>>>>>>>>>>>> performance.cache-invalidation on >>>>>>>>>>>>>>>>> gluster volume set atlasglust performance.md-cache-timeout >>>>>>>>>>>>>>>>> 600 >>>>>>>>>>>>>>>>> gluster volume set atlasglust performance.parallel-readdir >>>>>>>>>>>>>>>>> on >>>>>>>>>>>>>>>>> gluster volume set atlasglust performance.cache-size 1GB >>>>>>>>>>>>>>>>> gluster volume set atlasglust >>>>>>>>>>>>>>>>> performance.client-io-threads on >>>>>>>>>>>>>>>>> gluster volume set atlasglust cluster.lookup-optimize on >>>>>>>>>>>>>>>>> gluster volume set atlasglust performance.stat-prefetch on >>>>>>>>>>>>>>>>> gluster volume set atlasglust client.event-threads 4 >>>>>>>>>>>>>>>>> gluster volume set atlasglust server.event-threads 4 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> clients are mounted with this option >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> defaults,direct-io-mode=disabl >>>>>>>>>>>>>>>>> e,attribute-timeout=600,entry- >>>>>>>>>>>>>>>>> timeout=600,negative-timeout=600,fopen-keep-cache,rw,_netdev >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I can't see anything in the log file. Can someone suggest >>>>>>>>>>>>>>>>> that how to troubleshoot this issue? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Can you please share the log file? Checking for messages >>>>>>>>>>>>>>>> related to disconnections/crashes in the log file would be a >>>>>>>>>>>>>>>> good way to >>>>>>>>>>>>>>>> start troubleshooting the problem. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>> Vijay >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>>>>>> Gluster-users@gluster.org >>>>>>>>>>>>>>> http://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> Milind >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Milind >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Milind >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Milind >>>>>> >>>>>> >>>>> >>>> >>>> >>>> -- >>>> Milind >>>> >>>> >>> >> >
_______________________________________________ Gluster-users mailing list Gluster-users@gluster.org http://lists.gluster.org/mailman/listinfo/gluster-users