On Mon, Jun 18, 2018 at 8:11 AM, Raghavendra Gowdappa <rgowd...@redhat.com> wrote:
> From the bt: > > #8 0x00007f6ef977e6de in rda_readdirp (frame=0x7f6eec862320, > this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=357, off=2, > xdata=0x7f6eec0085a0) at readdir-ahead.c:266 > #9 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized out>, > cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0, > orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at > dht-common.c:5388 > #10 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec862210, > this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2, > xdata=0x7f6eec0085a0) at readdir-ahead.c:266 > #11 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized out>, > cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0, > orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at > dht-common.c:5388 > #12 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec862100, > this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2, > xdata=0x7f6eec0085a0) at readdir-ahead.c:266 > #13 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized out>, > cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0, > orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at > dht-common.c:5388 > #14 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec861ff0, > this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2, > xdata=0x7f6eec0085a0) at readdir-ahead.c:266 > #15 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized out>, > cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0, > orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at > dht-common.c:5388 > #16 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec861ee0, > this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2, > xdata=0x7f6eec0085a0) at readdir-ahead.c:266 > #17 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized out>, > cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0, > orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at > dht-common.c:5388 > #18 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec861dd0, > this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2, > xdata=0x7f6eec0085a0) at readdir-ahead.c:266 > #19 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized out>, > cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0, > orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at > dht-common.c:5388 > #20 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec861cc0, > this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2, > xdata=0x7f6eec0085a0) at readdir-ahead.c:266 > #21 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized out>, > cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0, > orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at > dht-common.c:5388 > #22 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec861bb0, > this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2, > xdata=0x7f6eec0085a0) at readdir-ahead.c:266 > #23 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized out>, > cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0, > orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at > dht-common.c:5388 > > It looks like an infinite recursion. Note that readdirp is wound to the > same subvol (value of "this" is same in all calls to rda_readdirp) at the > same offset (of value 2). This may be a bug in DHT (winding down readdirp > with wrong offset) or in readdir-ahead (populating incorrect offset values > in dentries it returns as readdirp response). > It looks to be a corruption. Value of size argument in rda_readdirp is too big (around 127 TB) to be sane. If you've a reproducer, please run it in valgrind or ASAN. To make it explicit, ATM its not clear that there is bug in readdir-ahead or DHT as it looks to be a memory corruption. Till I get a reproducer or valgrind/ASAN output of client process when the issue occcurs, I won't be working on this problem. > > On Wed, Jun 13, 2018 at 4:29 PM, mohammad kashif <kashif.a...@gmail.com> > wrote: > >> Hi Milind >> >> Thanks a lot, I manage to run gdb and produced traceback as well. Its here >> >> http://www-pnp.physics.ox.ac.uk/~mohammad/backtrace.log >> >> >> I am trying to understand but still not able to make sense out of it. >> >> Thanks >> >> Kashif >> >> On Wed, Jun 13, 2018 at 11:34 AM, Milind Changire <mchan...@redhat.com> >> wrote: >> >>> Kashif, >>> FYI: http://debuginfo.centos.org/centos/6/storage/x86_64/ >>> >>> >>> On Wed, Jun 13, 2018 at 3:21 PM, mohammad kashif <kashif.a...@gmail.com> >>> wrote: >>> >>>> Hi Milind >>>> >>>> There is no glusterfs-debuginfo available for gluster-3.12 from >>>> http://mirror.centos.org/centos/6/storage/x86_64/gluster-3.12/ repo. >>>> Do you know from where I can get it? >>>> Also when I run gdb, it says >>>> >>>> Missing separate debuginfos, use: debuginfo-install >>>> glusterfs-fuse-3.12.9-1.el6.x86_64 >>>> >>>> I can't find debug package for glusterfs-fuse either >>>> >>>> Thanks from the pit of despair ;) >>>> >>>> Kashif >>>> >>>> >>>> On Tue, Jun 12, 2018 at 5:01 PM, mohammad kashif <kashif.a...@gmail.com >>>> > wrote: >>>> >>>>> Hi Milind >>>>> >>>>> I will send you links for logs. >>>>> >>>>> I collected these core dumps at client and there is no glusterd >>>>> process running on client. >>>>> >>>>> Kashif >>>>> >>>>> >>>>> >>>>> On Tue, Jun 12, 2018 at 4:14 PM, Milind Changire <mchan...@redhat.com> >>>>> wrote: >>>>> >>>>>> Kashif, >>>>>> Could you also send over the client/mount log file as Vijay suggested >>>>>> ? >>>>>> Or maybe the lines with the crash backtrace lines >>>>>> >>>>>> Also, you've mentioned that you straced glusterd, but when you ran >>>>>> gdb, you ran it over /usr/sbin/glusterfs >>>>>> >>>>>> >>>>>> On Tue, Jun 12, 2018 at 8:19 PM, Vijay Bellur <vbel...@redhat.com> >>>>>> wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Jun 12, 2018 at 7:40 AM, mohammad kashif < >>>>>>> kashif.a...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi Milind >>>>>>>> >>>>>>>> The operating system is Scientific Linux 6 which is based on RHEL6. >>>>>>>> The cpu arch is Intel x86_64. >>>>>>>> >>>>>>>> I will send you a separate email with link to core dump. >>>>>>>> >>>>>>> >>>>>>> >>>>>>> You could also grep for crash in the client log file and the lines >>>>>>> following crash would have a backtrace in most cases. >>>>>>> >>>>>>> HTH, >>>>>>> Vijay >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> Thanks for your help. >>>>>>>> >>>>>>>> Kashif >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Jun 12, 2018 at 3:16 PM, Milind Changire < >>>>>>>> mchan...@redhat.com> wrote: >>>>>>>> >>>>>>>>> Kashif, >>>>>>>>> Could you share the core dump via Google Drive or something similar >>>>>>>>> >>>>>>>>> Also, let me know the CPU arch and OS Distribution on which you >>>>>>>>> are running gluster. >>>>>>>>> >>>>>>>>> If you've installed the glusterfs-debuginfo package, you'll also >>>>>>>>> get the source lines in the backtrace via gdb >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Jun 12, 2018 at 5:59 PM, mohammad kashif < >>>>>>>>> kashif.a...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hi Milind, Vijay >>>>>>>>>> >>>>>>>>>> Thanks, I have some more information now as I straced glusterd on >>>>>>>>>> client >>>>>>>>>> >>>>>>>>>> 138544 0.000131 mprotect(0x7f2f70785000, 4096, >>>>>>>>>> PROT_READ|PROT_WRITE) = 0 <0.000026> >>>>>>>>>> 138544 0.000128 mprotect(0x7f2f70786000, 4096, >>>>>>>>>> PROT_READ|PROT_WRITE) = 0 <0.000027> >>>>>>>>>> 138544 0.000126 mprotect(0x7f2f70787000, 4096, >>>>>>>>>> PROT_READ|PROT_WRITE) = 0 <0.000027> >>>>>>>>>> 138544 0.000124 --- SIGSEGV {si_signo=SIGSEGV, >>>>>>>>>> si_code=SEGV_ACCERR, si_addr=0x7f2f7c60ef88} --- >>>>>>>>>> 138544 0.000051 --- SIGSEGV {si_signo=SIGSEGV, >>>>>>>>>> si_code=SI_KERNEL, si_addr=0} --- >>>>>>>>>> 138551 0.105048 +++ killed by SIGSEGV (core dumped) +++ >>>>>>>>>> 138550 0.000041 +++ killed by SIGSEGV (core dumped) +++ >>>>>>>>>> 138547 0.000008 +++ killed by SIGSEGV (core dumped) +++ >>>>>>>>>> 138546 0.000007 +++ killed by SIGSEGV (core dumped) +++ >>>>>>>>>> 138545 0.000007 +++ killed by SIGSEGV (core dumped) +++ >>>>>>>>>> 138544 0.000008 +++ killed by SIGSEGV (core dumped) +++ >>>>>>>>>> 138543 0.000007 +++ killed by SIGSEGV (core dumped) +++ >>>>>>>>>> >>>>>>>>>> As for I understand that somehow gluster is trying to access >>>>>>>>>> memory in appropriate manner and kernel sends SIGSEGV >>>>>>>>>> >>>>>>>>>> I also got the core dump. I am trying gdb first time so I am not >>>>>>>>>> sure whether I am using it correctly >>>>>>>>>> >>>>>>>>>> gdb /usr/sbin/glusterfs core.138536 >>>>>>>>>> >>>>>>>>>> It just tell me that program terminated with signal 11, >>>>>>>>>> segmentation fault . >>>>>>>>>> >>>>>>>>>> The problem is not limited to one client but happening to many >>>>>>>>>> clients. >>>>>>>>>> >>>>>>>>>> I will really appreciate any help as whole file system has become >>>>>>>>>> unusable >>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>> >>>>>>>>>> Kashif >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, Jun 12, 2018 at 12:26 PM, Milind Changire < >>>>>>>>>> mchan...@redhat.com> wrote: >>>>>>>>>> >>>>>>>>>>> Kashif, >>>>>>>>>>> You can change the log level by: >>>>>>>>>>> $ gluster volume set <vol> diagnostics.brick-log-level TRACE >>>>>>>>>>> $ gluster volume set <vol> diagnostics.client-log-level TRACE >>>>>>>>>>> >>>>>>>>>>> and see how things fare >>>>>>>>>>> >>>>>>>>>>> If you want fewer logs you can change the log-level to DEBUG >>>>>>>>>>> instead of TRACE. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Tue, Jun 12, 2018 at 3:37 PM, mohammad kashif < >>>>>>>>>>> kashif.a...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Vijay >>>>>>>>>>>> >>>>>>>>>>>> Now it is unmounting every 30 mins ! >>>>>>>>>>>> >>>>>>>>>>>> The server log at >>>>>>>>>>>> /var/log/glusterfs/bricks/glusteratlas-brics001-gv0.log >>>>>>>>>>>> have this line only >>>>>>>>>>>> >>>>>>>>>>>> 2018-06-12 09:53:19.303102] I [MSGID: 115013] >>>>>>>>>>>> [server-helpers.c:289:do_fd_cleanup] 0-atlasglust-server: fd >>>>>>>>>>>> cleanup on /atlas/atlasdata/zgubic/hmumu/ >>>>>>>>>>>> histograms/v14.3/Signal >>>>>>>>>>>> [2018-06-12 09:53:19.306190] I [MSGID: 101055] >>>>>>>>>>>> [client_t.c:443:gf_client_unref] 0-atlasglust-server: Shutting >>>>>>>>>>>> down connection <server-name> -2224879-2018/06/12-09:51:01:4 >>>>>>>>>>>> 60889-atlasglust-client-0-0-0 >>>>>>>>>>>> >>>>>>>>>>>> There is no other information. Is there any way to increase log >>>>>>>>>>>> verbosity? >>>>>>>>>>>> >>>>>>>>>>>> on the client >>>>>>>>>>>> >>>>>>>>>>>> 2018-06-12 09:51:01.744980] I [MSGID: 114057] >>>>>>>>>>>> [client-handshake.c:1478:select_server_supported_programs] >>>>>>>>>>>> 0-atlasglust-client-5: Using Program GlusterFS 3.3, Num (1298437), >>>>>>>>>>>> Version >>>>>>>>>>>> (330) >>>>>>>>>>>> [2018-06-12 09:51:01.746508] I [MSGID: 114046] >>>>>>>>>>>> [client-handshake.c:1231:client_setvolume_cbk] >>>>>>>>>>>> 0-atlasglust-client-5: Connected to atlasglust-client-5, attached >>>>>>>>>>>> to remote >>>>>>>>>>>> volume '/glusteratlas/brick006/gv0'. >>>>>>>>>>>> [2018-06-12 09:51:01.746543] I [MSGID: 114047] >>>>>>>>>>>> [client-handshake.c:1242:client_setvolume_cbk] >>>>>>>>>>>> 0-atlasglust-client-5: Server and Client lk-version numbers are >>>>>>>>>>>> not same, >>>>>>>>>>>> reopening the fds >>>>>>>>>>>> [2018-06-12 09:51:01.746814] I [MSGID: 114035] >>>>>>>>>>>> [client-handshake.c:202:client_set_lk_version_cbk] >>>>>>>>>>>> 0-atlasglust-client-5: Server lk version = 1 >>>>>>>>>>>> [2018-06-12 09:51:01.748449] I [MSGID: 114057] >>>>>>>>>>>> [client-handshake.c:1478:select_server_supported_programs] >>>>>>>>>>>> 0-atlasglust-client-6: Using Program GlusterFS 3.3, Num (1298437), >>>>>>>>>>>> Version >>>>>>>>>>>> (330) >>>>>>>>>>>> [2018-06-12 09:51:01.750219] I [MSGID: 114046] >>>>>>>>>>>> [client-handshake.c:1231:client_setvolume_cbk] >>>>>>>>>>>> 0-atlasglust-client-6: Connected to atlasglust-client-6, attached >>>>>>>>>>>> to remote >>>>>>>>>>>> volume '/glusteratlas/brick007/gv0'. >>>>>>>>>>>> [2018-06-12 09:51:01.750261] I [MSGID: 114047] >>>>>>>>>>>> [client-handshake.c:1242:client_setvolume_cbk] >>>>>>>>>>>> 0-atlasglust-client-6: Server and Client lk-version numbers are >>>>>>>>>>>> not same, >>>>>>>>>>>> reopening the fds >>>>>>>>>>>> [2018-06-12 09:51:01.750503] I [MSGID: 114035] >>>>>>>>>>>> [client-handshake.c:202:client_set_lk_version_cbk] >>>>>>>>>>>> 0-atlasglust-client-6: Server lk version = 1 >>>>>>>>>>>> [2018-06-12 09:51:01.752207] I [fuse-bridge.c:4205:fuse_init] >>>>>>>>>>>> 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs >>>>>>>>>>>> 7.24 kernel >>>>>>>>>>>> 7.14 >>>>>>>>>>>> [2018-06-12 09:51:01.752261] I [fuse-bridge.c:4835:fuse_graph_sync] >>>>>>>>>>>> 0-fuse: switched to graph 0 >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> is there a problem with server and client 1k version? >>>>>>>>>>>> >>>>>>>>>>>> Thanks for your help. >>>>>>>>>>>> >>>>>>>>>>>> Kashif >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Jun 11, 2018 at 11:52 PM, Vijay Bellur < >>>>>>>>>>>> vbel...@redhat.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Mon, Jun 11, 2018 at 8:50 AM, mohammad kashif < >>>>>>>>>>>>> kashif.a...@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi >>>>>>>>>>>>>> >>>>>>>>>>>>>> Since I have updated our gluster server and client to latest >>>>>>>>>>>>>> version 3.12.9-1, I am having this issue of gluster getting >>>>>>>>>>>>>> unmounted from >>>>>>>>>>>>>> client very regularly. It was not a problem before update. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Its a distributed file system with no replication. We have >>>>>>>>>>>>>> seven servers totaling around 480TB data. Its 97% full. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I am using following config on server >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> gluster volume set atlasglust features.cache-invalidation on >>>>>>>>>>>>>> gluster volume set atlasglust features.cache-invalidation-timeout >>>>>>>>>>>>>> 600 >>>>>>>>>>>>>> gluster volume set atlasglust performance.stat-prefetch on >>>>>>>>>>>>>> gluster volume set atlasglust performance.cache-invalidation >>>>>>>>>>>>>> on >>>>>>>>>>>>>> gluster volume set atlasglust performance.md-cache-timeout 600 >>>>>>>>>>>>>> gluster volume set atlasglust performance.parallel-readdir on >>>>>>>>>>>>>> gluster volume set atlasglust performance.cache-size 1GB >>>>>>>>>>>>>> gluster volume set atlasglust performance.client-io-threads on >>>>>>>>>>>>>> gluster volume set atlasglust cluster.lookup-optimize on >>>>>>>>>>>>>> gluster volume set atlasglust performance.stat-prefetch on >>>>>>>>>>>>>> gluster volume set atlasglust client.event-threads 4 >>>>>>>>>>>>>> gluster volume set atlasglust server.event-threads 4 >>>>>>>>>>>>>> >>>>>>>>>>>>>> clients are mounted with this option >>>>>>>>>>>>>> >>>>>>>>>>>>>> defaults,direct-io-mode=disable,attribute-timeout=600,entry- >>>>>>>>>>>>>> timeout=600,negative-timeout=600,fopen-keep-cache,rw,_netdev >>>>>>>>>>>>>> >>>>>>>>>>>>>> I can't see anything in the log file. Can someone suggest >>>>>>>>>>>>>> that how to troubleshoot this issue? >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Can you please share the log file? Checking for messages >>>>>>>>>>>>> related to disconnections/crashes in the log file would be a good >>>>>>>>>>>>> way to >>>>>>>>>>>>> start troubleshooting the problem. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> Vijay >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>>> Gluster-users@gluster.org >>>>>>>>>>>> http://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Milind >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Milind >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Milind >>>>>> >>>>>> >>>>> >>>> >>> >>> >>> -- >>> Milind >>> >>> >> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users@gluster.org >> http://lists.gluster.org/mailman/listinfo/gluster-users >> > >
_______________________________________________ Gluster-users mailing list Gluster-users@gluster.org http://lists.gluster.org/mailman/listinfo/gluster-users