Hi Nithya It seems that problem can be solved by either turning parallel-readir off or downgrading client to 3.10.12-1 . Yesterday I downgraded some clients to 3.10.12-1 and it seems to fixed the problem. Today when I saw your email then I disabled parallel-readir off and the current client 3.12.9-1 started to work. I upgraded server and clients to 3.12.9-1 last month and since then clients were intermittently unmounting once in a week. But during last three days, it started unmounting every few minutes. I don't know that what triggered this sudden panic except that file system was quite full; around 98%. It is 480 TB file system. The file system has almost 80 Million files.
Servers have 64GB RAM and clients have 64GB to 192GB RAM. I tested with 192GB RAM client and it still had the same issue. Volume Name: atlasglust Type: Distribute Volume ID: fbf0ebb8-deab-4388-9d8a-f722618a624b Status: Started Snapshot Count: 0 Number of Bricks: 7 Transport-type: tcp Bricks: Brick1: pplxgluster01.X.Y.Z/glusteratlas/brick001/gv0 Brick2: pplxgluster02.X.Y.Z:/glusteratlas/brick002/gv0 Brick3: pplxgluster03.X.Y.Z:/glusteratlas/brick003/gv0 Brick4: pplxgluster04.X.Y.Z:/glusteratlas/brick004/gv0 Brick5: pplxgluster05.X.Y.Z:/glusteratlas/brick005/gv0 Brick6: pplxgluster06.X.Y.Z:/glusteratlas/brick006/gv0 Brick7: pplxgluster07.X.Y.Z:/glusteratlas/brick007/gv0 Options Reconfigured: diagnostics.client-log-level: ERROR diagnostics.brick-log-level: ERROR performance.cache-invalidation: on server.event-threads: 4 client.event-threads: 4 cluster.lookup-optimize: on performance.client-io-threads: on performance.cache-size: 1GB performance.parallel-readdir: off performance.md-cache-timeout: 600 performance.stat-prefetch: on features.cache-invalidation-timeout: 600 features.cache-invalidation: on auth.allow: X.Y.Z.* transport.address-family: inet performance.readdir-ahead: on nfs.disable: on Thanks Kashif On Thu, Jun 14, 2018 at 5:39 AM, Nithya Balachandran <nbala...@redhat.com> wrote: > +Poornima who works on parallel-readdir. > > @Poornima, Have you seen anything like this before? > > On 14 June 2018 at 10:07, Nithya Balachandran <nbala...@redhat.com> wrote: > >> This is not the same issue as the one you are referring - that was in the >> RPC layer and caused the bricks to crash. This one is different as it seems >> to be in the dht and rda layers. It does look like a stack overflow though. >> >> @Mohammad, >> >> Please send the following information: >> >> 1. gluster volume info >> 2. The number of entries in the directory being listed >> 3. System memory >> >> Does this still happen if you turn off parallel-readdir? >> >> Regards, >> Nithya >> >> >> >> >> On 13 June 2018 at 16:40, Milind Changire <mchan...@redhat.com> wrote: >> >>> +Nithya >>> >>> Nithya, >>> Do these logs [1] look similar to the recursive readdir() issue that >>> you encountered just a while back ? >>> i.e. recursive readdir() response definition in the XDR >>> >>> [1] http://www-pnp.physics.ox.ac.uk/~mohammad/backtrace.log >>> >>> >>> On Wed, Jun 13, 2018 at 4:29 PM, mohammad kashif <kashif.a...@gmail.com> >>> wrote: >>> >>>> Hi Milind >>>> >>>> Thanks a lot, I manage to run gdb and produced traceback as well. Its >>>> here >>>> >>>> http://www-pnp.physics.ox.ac.uk/~mohammad/backtrace.log >>>> >>>> >>>> I am trying to understand but still not able to make sense out of it. >>>> >>>> Thanks >>>> >>>> Kashif >>>> >>>> On Wed, Jun 13, 2018 at 11:34 AM, Milind Changire <mchan...@redhat.com> >>>> wrote: >>>> >>>>> Kashif, >>>>> FYI: http://debuginfo.centos.org/centos/6/storage/x86_64/ >>>>> >>>>> >>>>> On Wed, Jun 13, 2018 at 3:21 PM, mohammad kashif < >>>>> kashif.a...@gmail.com> wrote: >>>>> >>>>>> Hi Milind >>>>>> >>>>>> There is no glusterfs-debuginfo available for gluster-3.12 from >>>>>> http://mirror.centos.org/centos/6/storage/x86_64/gluster-3.12/ repo. >>>>>> Do you know from where I can get it? >>>>>> Also when I run gdb, it says >>>>>> >>>>>> Missing separate debuginfos, use: debuginfo-install >>>>>> glusterfs-fuse-3.12.9-1.el6.x86_64 >>>>>> >>>>>> I can't find debug package for glusterfs-fuse either >>>>>> >>>>>> Thanks from the pit of despair ;) >>>>>> >>>>>> Kashif >>>>>> >>>>>> >>>>>> On Tue, Jun 12, 2018 at 5:01 PM, mohammad kashif < >>>>>> kashif.a...@gmail.com> wrote: >>>>>> >>>>>>> Hi Milind >>>>>>> >>>>>>> I will send you links for logs. >>>>>>> >>>>>>> I collected these core dumps at client and there is no glusterd >>>>>>> process running on client. >>>>>>> >>>>>>> Kashif >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Jun 12, 2018 at 4:14 PM, Milind Changire < >>>>>>> mchan...@redhat.com> wrote: >>>>>>> >>>>>>>> Kashif, >>>>>>>> Could you also send over the client/mount log file as Vijay >>>>>>>> suggested ? >>>>>>>> Or maybe the lines with the crash backtrace lines >>>>>>>> >>>>>>>> Also, you've mentioned that you straced glusterd, but when you ran >>>>>>>> gdb, you ran it over /usr/sbin/glusterfs >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Jun 12, 2018 at 8:19 PM, Vijay Bellur <vbel...@redhat.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Jun 12, 2018 at 7:40 AM, mohammad kashif < >>>>>>>>> kashif.a...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hi Milind >>>>>>>>>> >>>>>>>>>> The operating system is Scientific Linux 6 which is based on >>>>>>>>>> RHEL6. The cpu arch is Intel x86_64. >>>>>>>>>> >>>>>>>>>> I will send you a separate email with link to core dump. >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> You could also grep for crash in the client log file and the lines >>>>>>>>> following crash would have a backtrace in most cases. >>>>>>>>> >>>>>>>>> HTH, >>>>>>>>> Vijay >>>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>> Thanks for your help. >>>>>>>>>> >>>>>>>>>> Kashif >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, Jun 12, 2018 at 3:16 PM, Milind Changire < >>>>>>>>>> mchan...@redhat.com> wrote: >>>>>>>>>> >>>>>>>>>>> Kashif, >>>>>>>>>>> Could you share the core dump via Google Drive or something >>>>>>>>>>> similar >>>>>>>>>>> >>>>>>>>>>> Also, let me know the CPU arch and OS Distribution on which you >>>>>>>>>>> are running gluster. >>>>>>>>>>> >>>>>>>>>>> If you've installed the glusterfs-debuginfo package, you'll also >>>>>>>>>>> get the source lines in the backtrace via gdb >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Tue, Jun 12, 2018 at 5:59 PM, mohammad kashif < >>>>>>>>>>> kashif.a...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Milind, Vijay >>>>>>>>>>>> >>>>>>>>>>>> Thanks, I have some more information now as I straced glusterd >>>>>>>>>>>> on client >>>>>>>>>>>> >>>>>>>>>>>> 138544 0.000131 mprotect(0x7f2f70785000, 4096, >>>>>>>>>>>> PROT_READ|PROT_WRITE) = 0 <0.000026> >>>>>>>>>>>> 138544 0.000128 mprotect(0x7f2f70786000, 4096, >>>>>>>>>>>> PROT_READ|PROT_WRITE) = 0 <0.000027> >>>>>>>>>>>> 138544 0.000126 mprotect(0x7f2f70787000, 4096, >>>>>>>>>>>> PROT_READ|PROT_WRITE) = 0 <0.000027> >>>>>>>>>>>> 138544 0.000124 --- SIGSEGV {si_signo=SIGSEGV, >>>>>>>>>>>> si_code=SEGV_ACCERR, si_addr=0x7f2f7c60ef88} --- >>>>>>>>>>>> 138544 0.000051 --- SIGSEGV {si_signo=SIGSEGV, >>>>>>>>>>>> si_code=SI_KERNEL, si_addr=0} --- >>>>>>>>>>>> 138551 0.105048 +++ killed by SIGSEGV (core dumped) +++ >>>>>>>>>>>> 138550 0.000041 +++ killed by SIGSEGV (core dumped) +++ >>>>>>>>>>>> 138547 0.000008 +++ killed by SIGSEGV (core dumped) +++ >>>>>>>>>>>> 138546 0.000007 +++ killed by SIGSEGV (core dumped) +++ >>>>>>>>>>>> 138545 0.000007 +++ killed by SIGSEGV (core dumped) +++ >>>>>>>>>>>> 138544 0.000008 +++ killed by SIGSEGV (core dumped) +++ >>>>>>>>>>>> 138543 0.000007 +++ killed by SIGSEGV (core dumped) +++ >>>>>>>>>>>> >>>>>>>>>>>> As for I understand that somehow gluster is trying to access >>>>>>>>>>>> memory in appropriate manner and kernel sends SIGSEGV >>>>>>>>>>>> >>>>>>>>>>>> I also got the core dump. I am trying gdb first time so I am >>>>>>>>>>>> not sure whether I am using it correctly >>>>>>>>>>>> >>>>>>>>>>>> gdb /usr/sbin/glusterfs core.138536 >>>>>>>>>>>> >>>>>>>>>>>> It just tell me that program terminated with signal 11, >>>>>>>>>>>> segmentation fault . >>>>>>>>>>>> >>>>>>>>>>>> The problem is not limited to one client but happening to many >>>>>>>>>>>> clients. >>>>>>>>>>>> >>>>>>>>>>>> I will really appreciate any help as whole file system has >>>>>>>>>>>> become unusable >>>>>>>>>>>> >>>>>>>>>>>> Thanks >>>>>>>>>>>> >>>>>>>>>>>> Kashif >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Jun 12, 2018 at 12:26 PM, Milind Changire < >>>>>>>>>>>> mchan...@redhat.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Kashif, >>>>>>>>>>>>> You can change the log level by: >>>>>>>>>>>>> $ gluster volume set <vol> diagnostics.brick-log-level TRACE >>>>>>>>>>>>> $ gluster volume set <vol> diagnostics.client-log-level TRACE >>>>>>>>>>>>> >>>>>>>>>>>>> and see how things fare >>>>>>>>>>>>> >>>>>>>>>>>>> If you want fewer logs you can change the log-level to DEBUG >>>>>>>>>>>>> instead of TRACE. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, Jun 12, 2018 at 3:37 PM, mohammad kashif < >>>>>>>>>>>>> kashif.a...@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Vijay >>>>>>>>>>>>>> >>>>>>>>>>>>>> Now it is unmounting every 30 mins ! >>>>>>>>>>>>>> >>>>>>>>>>>>>> The server log at >>>>>>>>>>>>>> /var/log/glusterfs/bricks/glusteratlas-brics001-gv0.log >>>>>>>>>>>>>> have this line only >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2018-06-12 09:53:19.303102] I [MSGID: 115013] >>>>>>>>>>>>>> [server-helpers.c:289:do_fd_cleanup] 0-atlasglust-server: fd >>>>>>>>>>>>>> cleanup on /atlas/atlasdata/zgubic/hmumu/ >>>>>>>>>>>>>> histograms/v14.3/Signal >>>>>>>>>>>>>> [2018-06-12 09:53:19.306190] I [MSGID: 101055] >>>>>>>>>>>>>> [client_t.c:443:gf_client_unref] 0-atlasglust-server: >>>>>>>>>>>>>> Shutting down connection <server-name> >>>>>>>>>>>>>> -2224879-2018/06/12-09:51:01:4 >>>>>>>>>>>>>> 60889-atlasglust-client-0-0-0 >>>>>>>>>>>>>> >>>>>>>>>>>>>> There is no other information. Is there any way to increase >>>>>>>>>>>>>> log verbosity? >>>>>>>>>>>>>> >>>>>>>>>>>>>> on the client >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2018-06-12 09:51:01.744980] I [MSGID: 114057] >>>>>>>>>>>>>> [client-handshake.c:1478:select_server_supported_programs] >>>>>>>>>>>>>> 0-atlasglust-client-5: Using Program GlusterFS 3.3, Num >>>>>>>>>>>>>> (1298437), Version >>>>>>>>>>>>>> (330) >>>>>>>>>>>>>> [2018-06-12 09:51:01.746508] I [MSGID: 114046] >>>>>>>>>>>>>> [client-handshake.c:1231:client_setvolume_cbk] >>>>>>>>>>>>>> 0-atlasglust-client-5: Connected to atlasglust-client-5, >>>>>>>>>>>>>> attached to remote >>>>>>>>>>>>>> volume '/glusteratlas/brick006/gv0'. >>>>>>>>>>>>>> [2018-06-12 09:51:01.746543] I [MSGID: 114047] >>>>>>>>>>>>>> [client-handshake.c:1242:client_setvolume_cbk] >>>>>>>>>>>>>> 0-atlasglust-client-5: Server and Client lk-version numbers are >>>>>>>>>>>>>> not same, >>>>>>>>>>>>>> reopening the fds >>>>>>>>>>>>>> [2018-06-12 09:51:01.746814] I [MSGID: 114035] >>>>>>>>>>>>>> [client-handshake.c:202:client_set_lk_version_cbk] >>>>>>>>>>>>>> 0-atlasglust-client-5: Server lk version = 1 >>>>>>>>>>>>>> [2018-06-12 09:51:01.748449] I [MSGID: 114057] >>>>>>>>>>>>>> [client-handshake.c:1478:select_server_supported_programs] >>>>>>>>>>>>>> 0-atlasglust-client-6: Using Program GlusterFS 3.3, Num >>>>>>>>>>>>>> (1298437), Version >>>>>>>>>>>>>> (330) >>>>>>>>>>>>>> [2018-06-12 09:51:01.750219] I [MSGID: 114046] >>>>>>>>>>>>>> [client-handshake.c:1231:client_setvolume_cbk] >>>>>>>>>>>>>> 0-atlasglust-client-6: Connected to atlasglust-client-6, >>>>>>>>>>>>>> attached to remote >>>>>>>>>>>>>> volume '/glusteratlas/brick007/gv0'. >>>>>>>>>>>>>> [2018-06-12 09:51:01.750261] I [MSGID: 114047] >>>>>>>>>>>>>> [client-handshake.c:1242:client_setvolume_cbk] >>>>>>>>>>>>>> 0-atlasglust-client-6: Server and Client lk-version numbers are >>>>>>>>>>>>>> not same, >>>>>>>>>>>>>> reopening the fds >>>>>>>>>>>>>> [2018-06-12 09:51:01.750503] I [MSGID: 114035] >>>>>>>>>>>>>> [client-handshake.c:202:client_set_lk_version_cbk] >>>>>>>>>>>>>> 0-atlasglust-client-6: Server lk version = 1 >>>>>>>>>>>>>> [2018-06-12 09:51:01.752207] I [fuse-bridge.c:4205:fuse_init] >>>>>>>>>>>>>> 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs >>>>>>>>>>>>>> 7.24 kernel >>>>>>>>>>>>>> 7.14 >>>>>>>>>>>>>> [2018-06-12 09:51:01.752261] I >>>>>>>>>>>>>> [fuse-bridge.c:4835:fuse_graph_sync] >>>>>>>>>>>>>> 0-fuse: switched to graph 0 >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> is there a problem with server and client 1k version? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks for your help. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Kashif >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Mon, Jun 11, 2018 at 11:52 PM, Vijay Bellur < >>>>>>>>>>>>>> vbel...@redhat.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Mon, Jun 11, 2018 at 8:50 AM, mohammad kashif < >>>>>>>>>>>>>>> kashif.a...@gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Since I have updated our gluster server and client to >>>>>>>>>>>>>>>> latest version 3.12.9-1, I am having this issue of gluster >>>>>>>>>>>>>>>> getting >>>>>>>>>>>>>>>> unmounted from client very regularly. It was not a problem >>>>>>>>>>>>>>>> before update. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Its a distributed file system with no replication. We have >>>>>>>>>>>>>>>> seven servers totaling around 480TB data. Its 97% full. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I am using following config on server >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> gluster volume set atlasglust features.cache-invalidation on >>>>>>>>>>>>>>>> gluster volume set atlasglust >>>>>>>>>>>>>>>> features.cache-invalidation-timeout >>>>>>>>>>>>>>>> 600 >>>>>>>>>>>>>>>> gluster volume set atlasglust performance.stat-prefetch on >>>>>>>>>>>>>>>> gluster volume set atlasglust >>>>>>>>>>>>>>>> performance.cache-invalidation on >>>>>>>>>>>>>>>> gluster volume set atlasglust performance.md-cache-timeout >>>>>>>>>>>>>>>> 600 >>>>>>>>>>>>>>>> gluster volume set atlasglust performance.parallel-readdir >>>>>>>>>>>>>>>> on >>>>>>>>>>>>>>>> gluster volume set atlasglust performance.cache-size 1GB >>>>>>>>>>>>>>>> gluster volume set atlasglust performance.client-io-threads >>>>>>>>>>>>>>>> on >>>>>>>>>>>>>>>> gluster volume set atlasglust cluster.lookup-optimize on >>>>>>>>>>>>>>>> gluster volume set atlasglust performance.stat-prefetch on >>>>>>>>>>>>>>>> gluster volume set atlasglust client.event-threads 4 >>>>>>>>>>>>>>>> gluster volume set atlasglust server.event-threads 4 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> clients are mounted with this option >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> defaults,direct-io-mode=disabl >>>>>>>>>>>>>>>> e,attribute-timeout=600,entry- >>>>>>>>>>>>>>>> timeout=600,negative-timeout=600,fopen-keep-cache,rw,_netdev >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I can't see anything in the log file. Can someone suggest >>>>>>>>>>>>>>>> that how to troubleshoot this issue? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Can you please share the log file? Checking for messages >>>>>>>>>>>>>>> related to disconnections/crashes in the log file would be a >>>>>>>>>>>>>>> good way to >>>>>>>>>>>>>>> start troubleshooting the problem. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>> Vijay >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>>>>> Gluster-users@gluster.org >>>>>>>>>>>>>> http://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Milind >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Milind >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Milind >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Milind >>>>> >>>>> >>>> >>> >>> >>> -- >>> Milind >>> >>> >> >
_______________________________________________ Gluster-users mailing list Gluster-users@gluster.org http://lists.gluster.org/mailman/listinfo/gluster-users