Re: [Gluster-devel] [RFC] Reducing maintenance burden and moving fuse support to an external project
H. Keep me CCed, please, because for a couple last months I do not follow GlusterFS development… On pátek 3. března 2017 21:50:07 CET Niels de Vos wrote: > At the moment we have three top-level interfaces to maintain in Gluster, > these are FUSE, Gluster/NFS and gfapi. If any work is needed to support > new options, FOPs or other functionalities, we mostly have to do the > work 3x. Often one of the interfaces gets forgotten, or does not need > the new feature immediately (backlog++). This is bothering me every now > and then, specially when bugs get introduced and need to get fixed in > different ways for these three interfaced. > > One of my main goals is to reduce the code duplication, and move > everything to gfapi. We are on a good way to use NFS-Ganesha instead of > Gluster/NFS already. In a similar approach, I would love to see > deprecating our xlators/mount sources[0], and have it replaced by > xglfs[1] from Oleksandr. > > Having the FUSE mount binaries provided by a separate project should > make it easier to implement things like subdirectory mounts (Samba and > NFS-Ganesha already do this in some form through gfapi). > > xglfs is not packaged in any distribution yet, this allows us to change > the current commandline interface to something we deem more suitable (if > so). > > I would like to get some opinions from others, and if there are no > absolute objections, we can work out a plan to make xglfs an alternative > to the fuse-bridge and eventually replace it. > > Thanks, > Niels > > > 0. https://github.com/gluster/glusterfs/tree/master/xlators/mount > 1. https://github.com/gluster/xglfs ___ Gluster-devel mailing list Gluster-devel@gluster.org http://lists.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] Memory management and friends
Hello. As a result of today's community meeting I start dedicated ML thread for gathering memory management issues together to make it possible to summarize them and construct some plan what to do next. Very important notice: I'm not the active GlusterFS developer, but gained excessive experience with GlusterFS in the past at previous work, and the main issue that was chasing me all the time was memory leaking. Consider this as a request for action from GlusterFS customer, apparently approved by Kaushal and Amye during last meeting :). So, here go key points. 1) Almost all nasty and obvious memory leaks have been successfully fixed during the last year, and that allowed me to run GlusterFS in production at previous work for almost all types of workload except one — dovecot mail storage. The specific of this workload is that it involved huge amount of files, and I assume this to be kinda of edge case unhiding some dark corners of GlusterFS memory management. I was able to provide Nithya with Valgrind+Massif memory profiling results and test case, and that helped her to prepare at least 1 extra fix (and more to come AFAIK), which has some deal with readdirp-related code. Nevertheless, it is reported that this is not the major source of leaking. Nithya suspect that memory gets fragmented heavily due to lots of small allocations, and memory pools cannot cope with this kind of fragmentation under constant load. Related BZs: * https://bugzilla.redhat.com/show_bug.cgi?id=1369364 * https://bugzilla.redhat.com/show_bug.cgi?id=1380249 People involved: * nbalacha, could you please provide more info on your findings? 2) Meanwhile, Jeff goes on with brick multiplexing feature, facing some issue with memory management too and blaming memory pools for that. Related ML email: * http://www.gluster.org/pipermail/gluster-devel/2016-October/051118.html * http://www.gluster.org/pipermail/gluster-devel/2016-October/051160.html People involved: * jdarcy, have you discussed this outside of ML? It seems your email didn't get proper attention. 3) We had brief discussion with obnox and anoopcs on #gluster-meeting and #gluster-dev regarding jemalloc and talloc. obnox believes that we may use both, jemalloc for substituting malloc/free, talloc for rewriting memory management for GlusterFS properly. Related logs: * https://botbot.me/freenode/gluster-dev/2016-10-26/?msg=75501394&page=2 People involved: * obnox, could you share your ideas on this? To summarize: 1) we need key devs involved in memory management to share their ideas; 2) using production-proven memory allocators and memory pools implementation is desired; 3) someone should manage the workflow of reconstructing memory management. Feel free to add anything I've missed. Regards, Oleksandr ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] Spurious failure of ./tests/bugs/glusterd/bug-913555.t
Hello. Vijay asked me to drop a note about spurious failure of ./tests/bugs/glusterd/bug-913555.t test. Here are the examples: * https://build.gluster.org/job/centos6-regression/1069/consoleFull * https://build.gluster.org/job/centos6-regression/1076/consoleFull Could someone take a look at it? Also, last two tests were broken because of this: === Slave went offline during the build === See these builds for details: * https://build.gluster.org/job/centos6-regression/1077/consoleFull * https://build.gluster.org/job/centos6-regression/1078/consoleFull Was that intentionally? Thanks. Regards, Oleksandr ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Libunwind
08.09.2016 16:07, Jeff Darcy wrote: (1) Has somebody already gone down this path? Does it work? We've switched most of our internal projects to libunwind. It works OK. (2) Are there any other reasons we wouldn't want to switch? No, just go and switch :). ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Profiling GlusterFS FUSE client with Valgrind's Massif tool
Created BZ for it [1]. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1373630 On вівторок, 6 вересня 2016 р. 23:32:51 EEST Pranith Kumar Karampuri wrote: > I included you on a thread on users, let us see if he can help you out. > > On Mon, Aug 29, 2016 at 4:02 PM, Oleksandr Natalenko < > > oleksa...@natalenko.name> wrote: > > More info here. > > > > Massif puts the following warning on volume unmount: > > > > === > > valgrind: m_mallocfree.c:304 (get_bszB_as_is): Assertion 'bszB_lo == > > bszB_hi' failed. > > valgrind: Heap block lo/hi size mismatch: lo = 1, hi = 0. > > This is probably caused by your program erroneously writing past the > > end of a heap block and corrupting heap metadata. If you fix any > > invalid writes reported by Memcheck, this assertion failure will > > probably go away. Please try that before reporting this as a bug. > > ... > > Thread 1: status = VgTs_Runnable > > ==30590==at 0x4C29037: free (in /usr/lib64/valgrind/vgpreload_ > > massif-amd64-linux.so) > > ==30590==by 0x67CE63B: __libc_freeres (in /usr/lib64/libc-2.17.so) > > ==30590==by 0x4A246B4: _vgnU_freeres (in > > /usr/lib64/valgrind/vgpreload_ > > core-amd64-linux.so) > > ==30590==by 0x66A2E2A: __run_exit_handlers (in /usr/lib64/libc-2.17.so > > ) > > ==30590==by 0x66A2EB4: exit (in /usr/lib64/libc-2.17.so) > > ==30590==by 0x1117E9: cleanup_and_exit (glusterfsd.c:1308) > > ==30590==by 0x669F66F: ??? (in /usr/lib64/libc-2.17.so) > > ==30590==by 0x606EEF4: pthread_join (in /usr/lib64/libpthread-2.17.so) > > ==30590==by 0x4EC2687: event_dispatch_epoll (event-epoll.c:762) > > ==30590==by 0x10E876: main (glusterfsd.c:2370) > > ... > > === > > > > I rechecked mount/ls/unmount with memcheck tool as suggested and got the > > following: > > > > === > > ... > > ==30315== Thread 8: > > ==30315== Syscall param writev(vector[...]) points to uninitialised > > byte(s) > > ==30315==at 0x675FEA0: writev (in /usr/lib64/libc-2.17.so) > > ==30315==by 0xE664795: send_fuse_iov (fuse-bridge.c:158) > > ==30315==by 0xE6649B9: send_fuse_data (fuse-bridge.c:197) > > ==30315==by 0xE666F7A: fuse_attr_cbk (fuse-bridge.c:753) > > ==30315==by 0xE6671A6: fuse_root_lookup_cbk (fuse-bridge.c:783) > > ==30315==by 0x14519937: io_stats_lookup_cbk (io-stats.c:1512) > > ==30315==by 0x14300B3E: mdc_lookup_cbk (md-cache.c:867) > > ==30315==by 0x13EE9226: qr_lookup_cbk (quick-read.c:446) > > ==30315==by 0x13CD8B66: ioc_lookup_cbk (io-cache.c:260) > > ==30315==by 0x1346405D: dht_revalidate_cbk (dht-common.c:985) > > ==30315==by 0x1320EC60: afr_discover_done (afr-common.c:2316) > > ==30315==by 0x1320EC60: afr_discover_cbk (afr-common.c:2361) > > ==30315==by 0x12F9EE91: client3_3_lookup_cbk (client-rpc-fops.c:2981) > > ==30315== Address 0x170b238c is on thread 8's stack > > ==30315== in frame #3, created by fuse_attr_cbk (fuse-bridge.c:723) > > ... > > ==30315== Warning: invalid file descriptor -1 in syscall close() > > ==30315== Thread 1: > > ==30315== Invalid free() / delete / delete[] / realloc() > > ==30315==at 0x4C2AD17: free (in /usr/lib64/valgrind/vgpreload_ > > memcheck-amd64-linux.so) > > ==30315==by 0x67D663B: __libc_freeres (in /usr/lib64/libc-2.17.so) > > ==30315==by 0x4A246B4: _vgnU_freeres (in > > /usr/lib64/valgrind/vgpreload_ > > core-amd64-linux.so) > > ==30315==by 0x66AAE2A: __run_exit_handlers (in /usr/lib64/libc-2.17.so > > ) > > ==30315==by 0x66AAEB4: exit (in /usr/lib64/libc-2.17.so) > > ==30315==by 0x1117E9: cleanup_and_exit (glusterfsd.c:1308) > > ==30315==by 0x66A766F: ??? (in /usr/lib64/libc-2.17.so) > > ==30315== by 0x6076EF4: pthread_join (in /usr/lib64/libpthread-2.17.so) > > ==30315==by 0x4ECA687: event_dispatch_epoll (event-epoll.c:762) > > ==30315==by 0x10E876: main (glusterfsd.c:2370) > > ==30315== Address 0x6a2d3d0 is 0 bytes inside data symbol > > "noai6ai_cached" > > === > > > > It seems Massif crashes (?) because of invalid memory access in glusterfs > > process cleanup stage. > > > > Pranith? Nithya? > > > > 29.08.2016 13:14, Oleksandr Natalenko wrote: > >> === > >> valgrind --tool=massif --trace-children=yes /usr/sbin/glusterfs -N > >> --volfile-server=server.example.com --volfile-id=test > >> /mnt/net/glusterfs/test > >> === > > > > ___ > > Gluster-devel mailing list > > Gluster-devel@gluster.org > > http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Profiling GlusterFS FUSE client with Valgrind's Massif tool
More info here. Massif puts the following warning on volume unmount: === valgrind: m_mallocfree.c:304 (get_bszB_as_is): Assertion 'bszB_lo == bszB_hi' failed. valgrind: Heap block lo/hi size mismatch: lo = 1, hi = 0. This is probably caused by your program erroneously writing past the end of a heap block and corrupting heap metadata. If you fix any invalid writes reported by Memcheck, this assertion failure will probably go away. Please try that before reporting this as a bug. ... Thread 1: status = VgTs_Runnable ==30590==at 0x4C29037: free (in /usr/lib64/valgrind/vgpreload_massif-amd64-linux.so) ==30590==by 0x67CE63B: __libc_freeres (in /usr/lib64/libc-2.17.so) ==30590==by 0x4A246B4: _vgnU_freeres (in /usr/lib64/valgrind/vgpreload_core-amd64-linux.so) ==30590==by 0x66A2E2A: __run_exit_handlers (in /usr/lib64/libc-2.17.so) ==30590==by 0x66A2EB4: exit (in /usr/lib64/libc-2.17.so) ==30590==by 0x1117E9: cleanup_and_exit (glusterfsd.c:1308) ==30590==by 0x669F66F: ??? (in /usr/lib64/libc-2.17.so) ==30590==by 0x606EEF4: pthread_join (in /usr/lib64/libpthread-2.17.so) ==30590==by 0x4EC2687: event_dispatch_epoll (event-epoll.c:762) ==30590==by 0x10E876: main (glusterfsd.c:2370) ... === I rechecked mount/ls/unmount with memcheck tool as suggested and got the following: === ... ==30315== Thread 8: ==30315== Syscall param writev(vector[...]) points to uninitialised byte(s) ==30315==at 0x675FEA0: writev (in /usr/lib64/libc-2.17.so) ==30315==by 0xE664795: send_fuse_iov (fuse-bridge.c:158) ==30315==by 0xE6649B9: send_fuse_data (fuse-bridge.c:197) ==30315==by 0xE666F7A: fuse_attr_cbk (fuse-bridge.c:753) ==30315==by 0xE6671A6: fuse_root_lookup_cbk (fuse-bridge.c:783) ==30315==by 0x14519937: io_stats_lookup_cbk (io-stats.c:1512) ==30315==by 0x14300B3E: mdc_lookup_cbk (md-cache.c:867) ==30315==by 0x13EE9226: qr_lookup_cbk (quick-read.c:446) ==30315==by 0x13CD8B66: ioc_lookup_cbk (io-cache.c:260) ==30315==by 0x1346405D: dht_revalidate_cbk (dht-common.c:985) ==30315==by 0x1320EC60: afr_discover_done (afr-common.c:2316) ==30315==by 0x1320EC60: afr_discover_cbk (afr-common.c:2361) ==30315==by 0x12F9EE91: client3_3_lookup_cbk (client-rpc-fops.c:2981) ==30315== Address 0x170b238c is on thread 8's stack ==30315== in frame #3, created by fuse_attr_cbk (fuse-bridge.c:723) ... ==30315== Warning: invalid file descriptor -1 in syscall close() ==30315== Thread 1: ==30315== Invalid free() / delete / delete[] / realloc() ==30315==at 0x4C2AD17: free (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so) ==30315==by 0x67D663B: __libc_freeres (in /usr/lib64/libc-2.17.so) ==30315==by 0x4A246B4: _vgnU_freeres (in /usr/lib64/valgrind/vgpreload_core-amd64-linux.so) ==30315==by 0x66AAE2A: __run_exit_handlers (in /usr/lib64/libc-2.17.so) ==30315==by 0x66AAEB4: exit (in /usr/lib64/libc-2.17.so) ==30315==by 0x1117E9: cleanup_and_exit (glusterfsd.c:1308) ==30315==by 0x66A766F: ??? (in /usr/lib64/libc-2.17.so) ==30315==by 0x6076EF4: pthread_join (in /usr/lib64/libpthread-2.17.so) ==30315==by 0x4ECA687: event_dispatch_epoll (event-epoll.c:762) ==30315==by 0x10E876: main (glusterfsd.c:2370) ==30315== Address 0x6a2d3d0 is 0 bytes inside data symbol "noai6ai_cached" === It seems Massif crashes (?) because of invalid memory access in glusterfs process cleanup stage. Pranith? Nithya? 29.08.2016 13:14, Oleksandr Natalenko wrote: === valgrind --tool=massif --trace-children=yes /usr/sbin/glusterfs -N --volfile-server=server.example.com --volfile-id=test /mnt/net/glusterfs/test === ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] Profiling GlusterFS FUSE client with Valgrind's Massif tool
Hello. While dancing around huge memory consumption by FUSE client [1], I was suggested by Pranith to use Massif tool to find out the reason of the leak. Unfortunately, it does not work for me properly, and I believe I do something wrong. Instead of generating report after unmounting volume or sigterming glusterfs process, Valgrind generates 2 reports (for 2 PIDs) just right after launch, and does not update them further, even on exit. I believe, that is because something is going on with forking, but I cannot figure out, what's going wrong. The command I use to launch GlusterFS via Valgrind+Massif: === valgrind --tool=massif --trace-children=yes /usr/sbin/glusterfs -N --volfile-server=server.example.com --volfile-id=test /mnt/net/glusterfs/test === Any ideas or sample usecases for Massif+GlusterFS? Thanks. Regards, Oleksandr [1] https://bugzilla.redhat.com/show_bug.cgi?id=1369364 ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Update on GlusterFS-3.7.15
[1] and [2], please. Those are 2 parts of one fix that is backported from master. They are already backported to 3.8, so only backport to 3.7 is left. Regards, Oleksandr [1] http://review.gluster.org/#/c/14835/ [2] http://review.gluster.org/#/c/15167/ 22.08.2016 15:25, Kaushal M wrote: Notify the maintainers and me of any changes you need merged. You can reply to this thread to notify. Try to ensure that your changes get merged before this weekend. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] CentOS Regressions Failures for the last week
22.08.2016 10:34, Nigel Babu wrote: ./TESTS/BASIC/GFAPI/GFAPI-TRUNC.T; Failed 6 times Fixed: http://review.gluster.org/#/c/15223/ ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-Maintainers] Glusterfs-3.7.13 release plans
Does this issue have some fix pending, or there is just bugreport? 08.07.2016 15:12, Kaushal M написав: On Fri, Jul 8, 2016 at 2:22 PM, Raghavendra Gowdappa wrote: There seems to be a major inode leak in fuse-clients: https://bugzilla.redhat.com/show_bug.cgi?id=1353856 We have found an RCA through code reading (though have a high confidence on the RCA). Do we want to include this in 3.7.13? I'm not going to be delaying the release anymore. I'll be adding this issue into the release-notes as a known-issue. regards, Raghavendra. - Original Message - From: "Kaushal M" To: "Pranith Kumar Karampuri" Cc: maintain...@gluster.org, "Gluster Devel" Sent: Friday, July 8, 2016 11:51:11 AM Subject: Re: [Gluster-Maintainers] Glusterfs-3.7.13 release plans On Fri, Jul 8, 2016 at 9:59 AM, Pranith Kumar Karampuri wrote: > Could you take in http://review.gluster.org/#/c/14598/ as well? It is ready > for merge. > > On Thu, Jul 7, 2016 at 3:02 PM, Atin Mukherjee wrote: >> >> Can you take in http://review.gluster.org/#/c/14861 ? Can you get one of the maintainers to give it a +2? >> >> >> On Thursday 7 July 2016, Kaushal M wrote: >>> >>> On Thu, Jun 30, 2016 at 11:08 AM, Kaushal M wrote: >>> > Hi all, >>> > >>> > I'm (or was) planning to do a 3.7.13 release on schedule today. 3.7.12 >>> > has a huge issue with libgfapi, solved by [1]. >>> > I'm not sure if this fixes the other issues with libgfapi noticed by >>> > Lindsay on gluster-users. >>> > >>> > This patch has been included in the packages 3.7.12 built for CentOS, >>> > Fedora, Ubuntu, Debian and SUSE. I guess Lindsay is using one of these >>> > packages, so it might be that the issue seen is new. So I'd like to do >>> > a quick release once we have a fix. >>> > >>> > Maintainers can merge changes into release-3.7 that follow the >>> > criteria given in [2]. Please make sure to add the bugs for patches >>> > you are merging are added as dependencies for the 3.7.13 tracker bug >>> > [3]. >>> > >>> >>> I've just merged the fix for the gfapi breakage into release-3.7, and >>> hope to tag 3.7.13 soon. >>> >>> The current head for release-3.7 is commit bddf6f8. 18 patches have >>> been merged since 3.7.12 for the following components, >>> - gfapi >>> - nfs (includes ganesha related changes) >>> - glusterd/cli >>> - libglusterfs >>> - fuse >>> - build >>> - geo-rep >>> - afr >>> >>> I need and acknowledgement from the maintainers of the above >>> components that they are ready. >>> If any maintainers know of any other issues, please reply here. We'll >>> decide how to address them for this release here. >>> >>> Also, please don't merge anymore changes into release-3.7. If you need >>> to get something merged, please inform me. >>> >>> Thanks, >>> Kaushal >>> >>> > Thanks, >>> > Kaushal >>> > >>> > [1]: https://review.gluster.org/14822 >>> > [2]: https://public.pad.fsfe.org/p/glusterfs-release-process-201606 >>> > under the GlusterFS minor release heading >>> > [3]: https://bugzilla.redhat.com/show_bug.cgi?id=glusterfs-3.7.13 >>> ___ >>> maintainers mailing list >>> maintain...@gluster.org >>> http://www.gluster.org/mailman/listinfo/maintainers >> >> >> >> -- >> Atin >> Sent from iPhone >> >> ___ >> maintainers mailing list >> maintain...@gluster.org >> http://www.gluster.org/mailman/listinfo/maintainers >> > > > > -- > Pranith ___ maintainers mailing list maintain...@gluster.org http://www.gluster.org/mailman/listinfo/maintainers ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Huge VSZ (VIRT) usage by glustershd on dummy node
OK, here the results go. I've taken 5 statedumps with 30 mins between each statedump. Also, before taking the statedump, I've recorded memory usage. Memory consumption: 1. root 1010 0.0 9.6 7538188 374864 ? Ssl чер07 0:16 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/glusterd/glustershd/run/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/7848e17764dd4ba80f4623aecb91b07a.socket --xlator-option *replicate*.node-uuid=80bc95e1-2027-4a96-bb66-d9c8ade624d7 2. root 1010 0.0 9.6 7825048 375312 ? Ssl чер07 0:16 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/glusterd/glustershd/run/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/7848e17764dd4ba80f4623aecb91b07a.socket --xlator-option *replicate*.node-uuid=80bc95e1-2027-4a96-bb66-d9c8ade624d7 3. root 1010 0.0 9.6 7825048 375312 ? Ssl чер07 0:17 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/glusterd/glustershd/run/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/7848e17764dd4ba80f4623aecb91b07a.socket --xlator-option *replicate*.node-uuid=80bc95e1-2027-4a96-bb66-d9c8ade624d7 4. root 1010 0.0 9.6 8202064 375892 ? Ssl чер07 0:17 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/glusterd/glustershd/run/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/7848e17764dd4ba80f4623aecb91b07a.socket --xlator-option *replicate*.node-uuid=80bc95e1-2027-4a96-bb66-d9c8ade624d7 5. root 1010 0.0 9.6 8316808 376084 ? Ssl чер07 0:17 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/glusterd/glustershd/run/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/7848e17764dd4ba80f4623aecb91b07a.socket --xlator-option *replicate*.node-uuid=80bc95e1-2027-4a96-bb66-d9c8ade624d7 As you may see VIRT constantly grows (except for one measurements), and RSS grows as well, although its increase is considerably smaller. Now lets take a look at statedumps: 1. https://gist.github.com/3fa121c7531d05b210b84d9db763f359 2. https://gist.github.com/87f48b8ac8378262b84d448765730fd9 3. https://gist.github.com/f8780014d8430d67687c70cfd1df9c5c 4. https://gist.github.com/916ac788f806328bad9de5311ce319d7 5. https://gist.github.com/8ba5dbf27d2cc61c04ca954d7fb0a7fd I'd go with comparing first statedump with last one, and here is diff output: https://gist.github.com/e94e7f17fe8b3688c6a92f49cbc15193 I see numbers changing, but now cannot conclude what is meaningful and what is meaningless. Pranith? 08.06.2016 10:06, Pranith Kumar Karampuri написав: On Wed, Jun 8, 2016 at 12:33 PM, Oleksandr Natalenko wrote: Yup, I can do that, but please note that RSS does not change. Will statedump show VIRT values? Also, I'm looking at the numbers now, and see that on each reconnect VIRT grows by ~24M (once per ~10–15 mins). Probably, that could give you some idea what is going wrong. That's interesting. Never saw something like this happen. I would still like to see if there are any clues in statedump when all this happens. May be what you said will be confirmed that nothing new is allocated but I would just like to confirm. 08.06.2016 09:50, Pranith Kumar Karampuri написав: Oleksandr, Could you take statedump of the shd process once in 5-10 minutes and send may be 5 samples of them when it starts to increase? This will help us find what datatypes are being allocated a lot and can lead to coming up with possible theories for the increase. On Wed, Jun 8, 2016 at 12:03 PM, Oleksandr Natalenko wrote: Also, I've checked shd log files, and found out that for some reason shd constantly reconnects to bricks: [1] Please note that suggested fix [2] by Pranith does not help, VIRT value still grows: === root 1010 0.0 9.6 7415248 374688 ? Ssl чер07 0:14 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/glusterd/glustershd/run/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/7848e17764dd4ba80f4623aecb91b07a.socket --xlator-option *replicate*.node-uuid=80bc95e1-2027-4a96-bb66-d9c8ade624d7 === I do not know the reason why it is reconnecting, but I suspect leak to happen on that reconnect. CCing Pranith. [1] http://termbin.com/brob [2] http://review.gluster.org/#/c/14053/ 06.06.2016 12:21, Kaushal M написав: Has multi-threaded SHD been merged into 3.7.* by any chance? If not, what I'm saying below doesn't apply. We saw problems when encrypted transports were used, because the RPC layer was not reaping threads (doing pthread_join) when a connection ended. This lead to similar observations of huge VIRT and relatively small RSS. I'm not sure how multi-threaded shd works, but it could be leaking threads in a similar way. On Mon, Jun 6, 2016 at 1:54 P
Re: [Gluster-devel] Huge VSZ (VIRT) usage by glustershd on dummy node
Yup, I can do that, but please note that RSS does not change. Will statedump show VIRT values? Also, I'm looking at the numbers now, and see that on each reconnect VIRT grows by ~24M (once per ~10–15 mins). Probably, that could give you some idea what is going wrong. 08.06.2016 09:50, Pranith Kumar Karampuri написав: Oleksandr, Could you take statedump of the shd process once in 5-10 minutes and send may be 5 samples of them when it starts to increase? This will help us find what datatypes are being allocated a lot and can lead to coming up with possible theories for the increase. On Wed, Jun 8, 2016 at 12:03 PM, Oleksandr Natalenko wrote: Also, I've checked shd log files, and found out that for some reason shd constantly reconnects to bricks: [1] Please note that suggested fix [2] by Pranith does not help, VIRT value still grows: === root 1010 0.0 9.6 7415248 374688 ? Ssl чер07 0:14 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/glusterd/glustershd/run/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/7848e17764dd4ba80f4623aecb91b07a.socket --xlator-option *replicate*.node-uuid=80bc95e1-2027-4a96-bb66-d9c8ade624d7 === I do not know the reason why it is reconnecting, but I suspect leak to happen on that reconnect. CCing Pranith. [1] http://termbin.com/brob [2] http://review.gluster.org/#/c/14053/ 06.06.2016 12:21, Kaushal M написав: Has multi-threaded SHD been merged into 3.7.* by any chance? If not, what I'm saying below doesn't apply. We saw problems when encrypted transports were used, because the RPC layer was not reaping threads (doing pthread_join) when a connection ended. This lead to similar observations of huge VIRT and relatively small RSS. I'm not sure how multi-threaded shd works, but it could be leaking threads in a similar way. On Mon, Jun 6, 2016 at 1:54 PM, Oleksandr Natalenko wrote: Hello. We use v3.7.11, replica 2 setup between 2 nodes + 1 dummy node for keeping volumes metadata. Now we observe huge VSZ (VIRT) usage by glustershd on dummy node: === root 15109 0.0 13.7 76552820 535272 ? Ssl тра26 2:11 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/glusterd/glustershd/run/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/7848e17764dd4ba80f4623aecb91b07a.socket --xlator-option *replicate*.node-uuid=80bc95e1-2027-4a96-bb66-d9c8ade624d7 === that is ~73G. RSS seems to be OK (~522M). Here is the statedump of glustershd process: [1] Also, here is sum of sizes, presented in statedump: === # cat /var/run/gluster/glusterdump.15109.dump.1465200139 | awk -F '=' 'BEGIN {sum=0} /^size=/ {sum+=$2} END {print sum}' 353276406 === That is ~337 MiB. Also, here are VIRT values from 2 replica nodes: === root 24659 0.0 0.3 5645836 451796 ? Ssl тра24 3:28 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/glusterd/glustershd/run/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/44ec3f29003eccedf894865107d5db90.socket --xlator-option *replicate*.node-uuid=a19afcc2-e26c-43ce-bca6-d27dc1713e87 root 18312 0.0 0.3 6137500 477472 ? Ssl тра19 6:37 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/glusterd/glustershd/run/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/1670a3abbd1eea968126eb6f5be20322.socket --xlator-option *replicate*.node-uuid=52dca21b-c81c-48b5-9de2-1ed37987fbc2 === Those are 5 to 6G, which is much less than dummy node has, but still look too big for us. Should we care about huge VIRT value on dummy node? Also, how one would debug that? Regards, Oleksandr. [1] https://gist.github.com/d2cfa25251136512580220fcdb8a6ce6 ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel -- Pranith ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Huge VSZ (VIRT) usage by glustershd on dummy node
Also, I've checked shd log files, and found out that for some reason shd constantly reconnects to bricks: [1] Please note that suggested fix [2] by Pranith does not help, VIRT value still grows: === root 1010 0.0 9.6 7415248 374688 ? Ssl чер07 0:14 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/glusterd/glustershd/run/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/7848e17764dd4ba80f4623aecb91b07a.socket --xlator-option *replicate*.node-uuid=80bc95e1-2027-4a96-bb66-d9c8ade624d7 === I do not know the reason why it is reconnecting, but I suspect leak to happen on that reconnect. CCing Pranith. [1] http://termbin.com/brob [2] http://review.gluster.org/#/c/14053/ 06.06.2016 12:21, Kaushal M написав: Has multi-threaded SHD been merged into 3.7.* by any chance? If not, what I'm saying below doesn't apply. We saw problems when encrypted transports were used, because the RPC layer was not reaping threads (doing pthread_join) when a connection ended. This lead to similar observations of huge VIRT and relatively small RSS. I'm not sure how multi-threaded shd works, but it could be leaking threads in a similar way. On Mon, Jun 6, 2016 at 1:54 PM, Oleksandr Natalenko wrote: Hello. We use v3.7.11, replica 2 setup between 2 nodes + 1 dummy node for keeping volumes metadata. Now we observe huge VSZ (VIRT) usage by glustershd on dummy node: === root 15109 0.0 13.7 76552820 535272 ? Ssl тра26 2:11 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/glusterd/glustershd/run/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/7848e17764dd4ba80f4623aecb91b07a.socket --xlator-option *replicate*.node-uuid=80bc95e1-2027-4a96-bb66-d9c8ade624d7 === that is ~73G. RSS seems to be OK (~522M). Here is the statedump of glustershd process: [1] Also, here is sum of sizes, presented in statedump: === # cat /var/run/gluster/glusterdump.15109.dump.1465200139 | awk -F '=' 'BEGIN {sum=0} /^size=/ {sum+=$2} END {print sum}' 353276406 === That is ~337 MiB. Also, here are VIRT values from 2 replica nodes: === root 24659 0.0 0.3 5645836 451796 ? Ssl тра24 3:28 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/glusterd/glustershd/run/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/44ec3f29003eccedf894865107d5db90.socket --xlator-option *replicate*.node-uuid=a19afcc2-e26c-43ce-bca6-d27dc1713e87 root 18312 0.0 0.3 6137500 477472 ? Ssl тра19 6:37 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/glusterd/glustershd/run/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/1670a3abbd1eea968126eb6f5be20322.socket --xlator-option *replicate*.node-uuid=52dca21b-c81c-48b5-9de2-1ed37987fbc2 === Those are 5 to 6G, which is much less than dummy node has, but still look too big for us. Should we care about huge VIRT value on dummy node? Also, how one would debug that? Regards, Oleksandr. [1] https://gist.github.com/d2cfa25251136512580220fcdb8a6ce6 ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Huge VSZ (VIRT) usage by glustershd on dummy node
Also, I see lots of entries in pmap output: === 7ef9ff8f3000 4K - [ anon ] 7ef9ff8f4000 8192K rw--- [ anon ] 7efa000f4000 4K - [ anon ] 7efa000f5000 8192K rw--- [ anon ] === If I sum them, I get the following: === # pmap 15109 | grep '[ anon ]' | grep 8192K | wc -l 9261 $ echo "9261*(8192+4)" | bc 75903156 === Which is something like 70G+ I have got in VIRT. 06.06.2016 11:24, Oleksandr Natalenko написав: Hello. We use v3.7.11, replica 2 setup between 2 nodes + 1 dummy node for keeping volumes metadata. Now we observe huge VSZ (VIRT) usage by glustershd on dummy node: === root 15109 0.0 13.7 76552820 535272 ? Ssl тра26 2:11 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/glusterd/glustershd/run/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/7848e17764dd4ba80f4623aecb91b07a.socket --xlator-option *replicate*.node-uuid=80bc95e1-2027-4a96-bb66-d9c8ade624d7 === that is ~73G. RSS seems to be OK (~522M). Here is the statedump of glustershd process: [1] Also, here is sum of sizes, presented in statedump: === # cat /var/run/gluster/glusterdump.15109.dump.1465200139 | awk -F '=' 'BEGIN {sum=0} /^size=/ {sum+=$2} END {print sum}' 353276406 === That is ~337 MiB. Also, here are VIRT values from 2 replica nodes: === root 24659 0.0 0.3 5645836 451796 ? Ssl тра24 3:28 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/glusterd/glustershd/run/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/44ec3f29003eccedf894865107d5db90.socket --xlator-option *replicate*.node-uuid=a19afcc2-e26c-43ce-bca6-d27dc1713e87 root 18312 0.0 0.3 6137500 477472 ? Ssl тра19 6:37 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/glusterd/glustershd/run/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/1670a3abbd1eea968126eb6f5be20322.socket --xlator-option *replicate*.node-uuid=52dca21b-c81c-48b5-9de2-1ed37987fbc2 === Those are 5 to 6G, which is much less than dummy node has, but still look too big for us. Should we care about huge VIRT value on dummy node? Also, how one would debug that? Regards, Oleksandr. [1] https://gist.github.com/d2cfa25251136512580220fcdb8a6ce6 ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Huge VSZ (VIRT) usage by glustershd on dummy node
I believe, multi-threaded shd has not been merged at least into 3.7 branch prior to 3.7.11 (incl.), because I've found this [1]. [1] https://www.gluster.org/pipermail/maintainers/2016-April/000628.html 06.06.2016 12:21, Kaushal M написав: Has multi-threaded SHD been merged into 3.7.* by any chance? If not, what I'm saying below doesn't apply. We saw problems when encrypted transports were used, because the RPC layer was not reaping threads (doing pthread_join) when a connection ended. This lead to similar observations of huge VIRT and relatively small RSS. I'm not sure how multi-threaded shd works, but it could be leaking threads in a similar way. On Mon, Jun 6, 2016 at 1:54 PM, Oleksandr Natalenko wrote: Hello. We use v3.7.11, replica 2 setup between 2 nodes + 1 dummy node for keeping volumes metadata. Now we observe huge VSZ (VIRT) usage by glustershd on dummy node: === root 15109 0.0 13.7 76552820 535272 ? Ssl тра26 2:11 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/glusterd/glustershd/run/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/7848e17764dd4ba80f4623aecb91b07a.socket --xlator-option *replicate*.node-uuid=80bc95e1-2027-4a96-bb66-d9c8ade624d7 === that is ~73G. RSS seems to be OK (~522M). Here is the statedump of glustershd process: [1] Also, here is sum of sizes, presented in statedump: === # cat /var/run/gluster/glusterdump.15109.dump.1465200139 | awk -F '=' 'BEGIN {sum=0} /^size=/ {sum+=$2} END {print sum}' 353276406 === That is ~337 MiB. Also, here are VIRT values from 2 replica nodes: === root 24659 0.0 0.3 5645836 451796 ? Ssl тра24 3:28 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/glusterd/glustershd/run/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/44ec3f29003eccedf894865107d5db90.socket --xlator-option *replicate*.node-uuid=a19afcc2-e26c-43ce-bca6-d27dc1713e87 root 18312 0.0 0.3 6137500 477472 ? Ssl тра19 6:37 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/glusterd/glustershd/run/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/1670a3abbd1eea968126eb6f5be20322.socket --xlator-option *replicate*.node-uuid=52dca21b-c81c-48b5-9de2-1ed37987fbc2 === Those are 5 to 6G, which is much less than dummy node has, but still look too big for us. Should we care about huge VIRT value on dummy node? Also, how one would debug that? Regards, Oleksandr. [1] https://gist.github.com/d2cfa25251136512580220fcdb8a6ce6 ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] Huge VSZ (VIRT) usage by glustershd on dummy node
Hello. We use v3.7.11, replica 2 setup between 2 nodes + 1 dummy node for keeping volumes metadata. Now we observe huge VSZ (VIRT) usage by glustershd on dummy node: === root 15109 0.0 13.7 76552820 535272 ? Ssl тра26 2:11 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/glusterd/glustershd/run/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/7848e17764dd4ba80f4623aecb91b07a.socket --xlator-option *replicate*.node-uuid=80bc95e1-2027-4a96-bb66-d9c8ade624d7 === that is ~73G. RSS seems to be OK (~522M). Here is the statedump of glustershd process: [1] Also, here is sum of sizes, presented in statedump: === # cat /var/run/gluster/glusterdump.15109.dump.1465200139 | awk -F '=' 'BEGIN {sum=0} /^size=/ {sum+=$2} END {print sum}' 353276406 === That is ~337 MiB. Also, here are VIRT values from 2 replica nodes: === root 24659 0.0 0.3 5645836 451796 ? Ssl тра24 3:28 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/glusterd/glustershd/run/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/44ec3f29003eccedf894865107d5db90.socket --xlator-option *replicate*.node-uuid=a19afcc2-e26c-43ce-bca6-d27dc1713e87 root 18312 0.0 0.3 6137500 477472 ? Ssl тра19 6:37 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/glusterd/glustershd/run/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/1670a3abbd1eea968126eb6f5be20322.socket --xlator-option *replicate*.node-uuid=52dca21b-c81c-48b5-9de2-1ed37987fbc2 === Those are 5 to 6G, which is much less than dummy node has, but still look too big for us. Should we care about huge VIRT value on dummy node? Also, how one would debug that? Regards, Oleksandr. [1] https://gist.github.com/d2cfa25251136512580220fcdb8a6ce6 ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Idea: Alternate Release process
30.05.2016 05:08, Sankarshan Mukhopadhyay написав: It would perhaps be worthwhile to extend this release timeline/cadence discussion into (a) End-of-Life definition and invocation (b) whether a 'long term support' (assuming that is what LTS is) is of essentially any value to users of GlusterFS. (b) especially can be (and perhaps should be) addressed by predictable and tested upgrade paths to ensure that users are able to get to newer releases without much hassles. I believe 3.7 should be LTS with EOL in 1 year at least because it is the last branch released before changes to release process were committed. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Idea: Alternate Release process
My 2 cents on timings etc. Rationale: 1. deliver new features to users as fast as possible to get the feedback; 2. leave an option of using LTS branch for those who do not want update too often. Definition: * "stable release" — .0 tag that receives critical bugfixes and security updates for 16 weeks; * "LTS" — .0 tag that receives critical bugfixes and security updates for 1 year; New release happens every 8 weeks. Those 8 weeks include: * merge window for 3 weeks, during this time all ready features get merged into master; * feature freeze on -rc1 tagging; * 5 weeks of testing, bugfixing and preparing new features; * tagging .0 stable release. Example (imaginary versions and dates): March 1 — 5.0 release, merge window opens March 22 — 6.0-rc1 release, merge window closes, feature freeze, new -rc each week May 1 — 6.0 release, merge window opens, 5.0 still gets fixes May 22 — 7.0-rc1 release July 1 — 7.0 release, merge window closes, no more fixes for 5.0, 6.0 still gets fixes ... September 1 — 8.0 release, LTS, EOT is Sep 1, next year. ... Backward compatibility should be guaranteed during the time between two consecutive LTSes by excessive using of op-version. The user should have a possibility to upgrade from one LTS to another preferably with no downtime. LTS+1 is not guaranteed to backward compatible with LTS-1. Pros: * frequent releases with new features that do not break backward compatibility; * max 2 stable branches supported simultaneously; * guaranteed LTS branch with guaranteed upgrade to new LTS. Cons: * no idea what to do with things that break backward compatibility and that couldn't be implemented within op-version constraints (except postponing them for too much). ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [RFC] FUSE bridge based on GlusterFS API
11.04.2016 11:14, Niels de Vos wrote: I would like to add a detection for a xglfs executable in the /sbin/mount.glusterfs script. This then makes it possible to replace the original FUSE client with xglfs. If we do something similar in our regression tests, we can get an idea how stable and feature complete xglfs is. I believe adding it to tests prior to adding to packaged /sbin/mount.glusterfs script should be the first step. I assume this is like the FIBMAP-ioctl(). So, obviously, we do not need it. We actually do have a flush FOP in the GlusterFS protocol and xlators. But is has not been added to libgfapi. The library calls flush from glfs_close(). I'm not sure we really need to add glfs_flush() to libgfapi, most (all?) applications would likely use glfs_fsync() anyway? I've added dummy handler for this fop to always return 0. It shouldn't be a big deal to replace it with actual implementation if libgfapi gains glfs_flush() support. There is both glfs_fsync() and glfs_fdatasync(). These match their fsync() and fdatasync() counterparts. OK, so they are implemented correctly in xglfs now. * .fsyncdir fop (again, wtf?); This is like calling fsync() on a directory. It guarantees that changes in the directory (new/unlinked files) are persistent. Cannot find similar function in GlusterFS API. Should it be implemented first, or we are fine to proceed with dummy handler returning 0? * WHERE IS MY glfs_truncate()? Almost there, Jeff sent a patch. We just need a bug to linke the patch against. Saw that, many thanks to Jeff for accepting the challenge :). Would you be willing to move the GitHub repository under github.com/gluster ? It gives a little move visibility in our community that way. See no issue with that. Ping me in IRC to arrange the movement. Regards, post-factum ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [RFC] FUSE bridge based on GlusterFS API
On четвер, 7 квітня 2016 р. 16:12:07 EEST Jeff Darcy wrote: > "Considered wrong" might be overstating the case. It might be useful to > keep in mind that the fuse-bridge code predates GFAPI by a considerable > amount. In fact, significant parts of GFAPI were borrowed from the > existing fuse-bridge code. At the time, there was a choice between > FUSE's path-based and inode-based APIs. Using the path-based API has > always been easier, requiring less code. We can see this effect in the > fact that xglfs is a little less than 1/3 as code as fuse-bridge. Sure, API came years later after FUSE bridge, and I understand that. Also, that explains to me some kind of code doubling, when FUSE bridge is (completely) separate entity from API doing pretty much the same (as I thought) thing. > On the other hand, and again at the time, there were some pretty good > reasons to eschew the path-based API. I don't remember all of those > reasons (some of them predate even my involvement with the project) but > I'm pretty sure performance was chief among them. The path-based API is > completely synchronous, which would have been utterly disastrous for > performance prior to syncops (which of course came later). Even with > syncops, it's not clear whether that gap has been or can be closed. If > we ever wanted to consider switching fully to the path-based API, we'd > certainly need to examine that issue closely. Other issues that > differentiate the two APIs might include: > > * Access to unlinked files (which have no path). > > * Levels of control over various forms of caching. > > * Ability to use reverse invalidation. > > * Ability to support SELinux (which does nasty stuff during mount). > > * Other ops that might be present only in the inode API. > > * Security. > > Perhaps we should ping Miklos Szeredi (kernel FUSE maintainer, now at > Red Hat) about some of these. Also, Soumya pointed me to handles API (/usr/include/glusterfs/api/glfs- handles.h). If I got it correctly, probably, it could be used instead of path- based API for FUSE bridge? I have briefly looked at it, but the article about NFS handles (again, supplied to me by Soumya) remains unread so far :). Does handles API represend inode API you are talking about? Then, also, we shouldn't use highlevel FUSE API and stick to lowlevel instead as it (AFAIK, correct me if I'm wrong) operates on inodes as well. > > * FUSE .bmap fop (wtf is that?); > > * trickery with .fgetattr (do we need that trickery?); > > Not sure what you mean here. Do you mean fgetxattr? I mean .fgetattr calling .getattr for / and glfs_fstat() for everything else. Not sure why it happens. BBFS code says: === // On FreeBSD, trying to do anything with the mountpoint ends up // opening it, and then using the FD for an fgetattr. So in the // special case of a path of "/", I need to do a getattr on the // underlying root directory instead of doing the fgetattr(). === Probably, I just wanted to note that (in case my toy could become portable across *nixes). > > > * .flush fop (no GlusterFS equivalent?); > > * fsync/fdatasync difference for GlusterFS; > > * .fsyncdir fop (again, wtf?); > > I suspect these are related to the path-based vs. inode-based issue. > Fact is, the VFS calls and syscalls have never lined up entirely, and it > shows up in differences like these. I definitely need help with handles API if that is the right thing to address to. > That just seems like a bug. There should be one. That is definitely the bug. /usr/include/glusterfs/api/glfs.h clearly defines it: === int glfs_truncate (glfs_t *fs, const char *path, off_t length) __THROW GFAPI_PUBLIC(glfs_truncate, 3.4.0); === But linking executable with a call to glfs_truncate() results in error: === CMakeFiles/xglfs.dir/xglfs_truncate.c.o: In function `xglfs_truncate': /home/pf/work/devel/own/xglfs/xglfs_truncate.c:31: undefined reference to `glfs_truncate' === The bug was discussed more than year ago [1], but it seems there is no solution so far. Thanks. Regards, post-factum [1] http://irclog.perlgeek.de/gluster-dev/2015-01-25/text ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] [RFC] FUSE bridge based on GlusterFS API
Hello. Pranith and Niels encouraged me to share here my toy project of simple FUSE bridge that uses GlusterFS API [1]. The rationale for this is that FUSE bridge currently present in GlusterFS code does not use GlusterFS API, and that is considered to be wrong, and there are some plans to replace it with modern solution. xglfs code could potentially go under glusterfs tree if the developers decide it should happen. Also, it could be rpm'ed and suggested to Fedora users. Now xglfs it is just a separate executable that relies on glusterfs-devel and fuse-devel packages and does simple conversion between FUSE VFS calls and GlusterFS API. Thanks to API completeness (well, glfs_truncate() is an exception, AFAIK), this custom bridge is really thin and small. As a guide I used Big Brother File System code by Joseph J. Pfeiffer, Jr. [2] that is freely available in the Internet (version 2014-06-12, but newer version has been released recently). However, I've adopted it to current FUSE libs reality just by inspecting /usr/include/fuse/fuse.h carefully and defining FUSE_USE_VERSION=26 explicitly. What I would like for reviewers to pay an attention for: * error path handling correctness (mostly, negated errno value is returned, is that correct?); * fops semantic correctness; * everything else you would like to comment on or suggest. The code itself has been verified by GCC, Clang (+analyzer), Intel C Compiler, cppcheck and Valgrind. No idea what could go wrong there :). However, I'm not responsible for data damage caused by this project, of course. Some things remain not so clear to me: * FUSE .bmap fop (wtf is that?); * trickery with .fgetattr (do we need that trickery?); * .flush fop (no GlusterFS equivalent?); * fsync/fdatasync difference for GlusterFS; * .fsyncdir fop (again, wtf?); * WHERE IS MY glfs_truncate()? Feel free to happily accept this project or ignore it silently. Nevertheless, I would be happy to see your pull requests or comments, or even results of some test you might want to perform on you critical production. Also, I know that Soumya has already tried xglfs, and I would be glad if she shares some experience on it. Best wishes, post-factum [1] https://github.com/pfactum/xglfs [2] http://www.cs.nmsu.edu/~pfeiffer/fuse-tutorial/ ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] Arbiter brick size estimation
And for 256b inode: (597904 - 33000) / (1066036 - 23) == 530 bytes per inode. So I still consider 1k to be good estimation for average workload. Regards, Oleksandr. On четвер, 17 березня 2016 р. 09:58:14 EET Ravishankar N wrote: > Looks okay to me Oleksandr. You might want to make a github gist of your > tests+results as a reference for others. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] Arbiter brick size estimation
Ravi, I will definitely arrange the results into some short handy document and post it here. Also, @JoeJulian on IRC suggested me to perform this test on XFS bricks with inode size of 256b and 1k: === 22:38 <@JoeJulian> post-factum: Just wondering what 256 byte inodes might look like for that. And, by the same token, 1k inodes. 22:39 < post-factum> JoeJulian: should I try 1k inodes instead? 22:41 <@JoeJulian> post-factum: Doesn't hurt to try. My expectation is that disk usage will go up despite inode usage going down. 22:41 < post-factum> JoeJulian: ok, will check that 22:41 <@JoeJulian> post-factum: and with 256, I'm curious if inode usage will stay close to the same while disk usage goes down. === Here are the results for 1k: (1171336 - 33000) / (1066036 - 23) == 1068 bytes per inode. Disk usage is indeed higher (1.2G), but inodes usage is the same. Will test with 256b inode now. 17.03.2016 06:28, Ravishankar N wrote: Looks okay to me Oleksandr. You might want to make a github gist of your tests+results as a reference for others. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] Arbiter brick size estimation
OK, I've repeated the test with the following hierarchy: * 10 top-level folders with 10 second-level folders each; * 10 000 files in each second-level folder. So, this composes 10×10×1=1M files and 100 folders Initial brick used space: 33 M Initial inodes count: 24 After test: * each brick in replica took 18G, and the arbiter brick took 836M; * inodes count: 1066036 So: (836 - 33) / (1066036 - 24) == 790 bytes per inode. So, yes, it is slightly bigger value than with previous test due to, I guess, lots of files in one folder, but it is still too far from 4k. Given a good engineer should consider 30% reserve, the ratio is about 1k per stored inode. Correct me if I'm missing something (regarding average workload and not corner cases). Test script is here: [1] Regards, Oleksandr. [1] http://termbin.com/qlvz On вівторок, 8 березня 2016 р. 19:13:05 EET Ravishankar N wrote: > On 03/05/2016 03:45 PM, Oleksandr Natalenko wrote: > > In order to estimate GlusterFS arbiter brick size, I've deployed test > > setup > > with replica 3 arbiter 1 volume within one node. Each brick is located on > > separate HDD (XFS with inode size == 512). Using GlusterFS v3.7.6 + > > memleak > > patches. Volume options are kept default. > > > > Here is the script that creates files and folders in mounted volume: [1] > > > > The script creates 1M of files of random size (between 1 and 32768 bytes) > > and some amount of folders. After running it I've got 1036637 folders. > > So, in total it is 2036637 files and folders. > > > > The initial used space on each brick is 42M . After running script I've > > got: > > > > replica brick 1 and 2: 19867168 kbytes == 19G > > arbiter brick: 1872308 kbytes == 1.8G > > > > The amount of inodes on each brick is 3139091. So here goes estimation. > > > > Dividing arbiter used space by files+folders we get: > > > > (1872308 - 42000)/2036637 == 899 bytes per file or folder > > > > Dividing arbiter used space by inodes we get: > > > > (1872308 - 42000)/3139091 == 583 bytes per inode > > > > Not sure about what calculation is correct. > > I think the first one is right because you still haven't used up all the > inodes.(2036637 used vs. the max. permissible 3139091). But again this > is an approximation because not all files would be 899 bytes. For > example if there are a thousand files present in a directory, then du > would be more than du because the directory will take > some disk space to store the dentries. > > > I guess we should consider the one > > > > that accounts inodes because of .glusterfs/ folder data. > > > > Nevertheless, in contrast, documentation [2] says it should be 4096 bytes > > per file. Am I wrong with my calculations? > > The 4KB is a conservative estimate considering the fact that though the > arbiter brick does not store data, it still keeps a copy of both user > and gluster xattrs. For example, if the application sets a lot of > xattrs, it can consume a data block if they cannot be accommodated on > the inode itself. Also there is the .glusterfs folder like you said > which would take up some space. Here is what I tried on an XFS brick: > [root@ravi4 brick]# touch file > > [root@ravi4 brick]# ls -l file > -rw-r--r-- 1 root root 0 Mar 8 12:54 file > > [root@ravi4 brick]# du file > *0 file** > * > [root@ravi4 brick]# for i in {1..100} > > > do > > setfattr -n user.value$i -v value$i file > > done > > [root@ravi4 brick]# ll -l file > -rw-r--r-- 1 root root 0 Mar 8 12:54 file > > [root@ravi4 brick]# du -h file > *4.0Kfile** > * > Hope this helps, > Ravi > > > Pranith? > > > > [1] http://termbin.com/ka9x > > [2] > > http://gluster.readthedocs.org/en/latest/Administrator%20Guide/arbiter-vo > > lumes-and-quorum/ ___ > > Gluster-devel mailing list > > Gluster-devel@gluster.org > > http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] Arbiter brick size estimation
Ravi, here is the summary: [1] Regards, Oleksandr. [1] https://gist.github.com/e8265ca07f7b19f30bb3 On четвер, 17 березня 2016 р. 09:58:14 EET Ravishankar N wrote: > On 03/16/2016 10:57 PM, Oleksandr Natalenko wrote: > > OK, I've repeated the test with the following hierarchy: > > > > * 10 top-level folders with 10 second-level folders each; > > * 10 000 files in each second-level folder. > > > > So, this composes 10×10×1=1M files and 100 folders > > > > Initial brick used space: 33 M > > Initial inodes count: 24 > > > > After test: > > > > * each brick in replica took 18G, and the arbiter brick took 836M; > > * inodes count: 1066036 > > > > So: > > > > (836 - 33) / (1066036 - 24) == 790 bytes per inode. > > > > So, yes, it is slightly bigger value than with previous test due to, I > > guess, lots of files in one folder, but it is still too far from 4k. > > Given a good engineer should consider 30% reserve, the ratio is about 1k > > per stored inode. > > > > Correct me if I'm missing something (regarding average workload and not > > corner cases). > > Looks okay to me Oleksandr. You might want to make a github gist of your > tests+results as a reference for others. > Regards, > Ravi > > > Test script is here: [1] > > > > Regards, > > > >Oleksandr. > > > > [1] http://termbin.com/qlvz > > > > On вівторок, 8 березня 2016 р. 19:13:05 EET Ravishankar N wrote: > >> On 03/05/2016 03:45 PM, Oleksandr Natalenko wrote: > >>> In order to estimate GlusterFS arbiter brick size, I've deployed test > >>> setup > >>> with replica 3 arbiter 1 volume within one node. Each brick is located > >>> on > >>> separate HDD (XFS with inode size == 512). Using GlusterFS v3.7.6 + > >>> memleak > >>> patches. Volume options are kept default. > >>> > >>> Here is the script that creates files and folders in mounted volume: [1] > >>> > >>> The script creates 1M of files of random size (between 1 and 32768 > >>> bytes) > >>> and some amount of folders. After running it I've got 1036637 folders. > >>> So, in total it is 2036637 files and folders. > >>> > >>> The initial used space on each brick is 42M . After running script I've > >>> got: > >>> > >>> replica brick 1 and 2: 19867168 kbytes == 19G > >>> arbiter brick: 1872308 kbytes == 1.8G > >>> > >>> The amount of inodes on each brick is 3139091. So here goes estimation. > >>> > >>> Dividing arbiter used space by files+folders we get: > >>> > >>> (1872308 - 42000)/2036637 == 899 bytes per file or folder > >>> > >>> Dividing arbiter used space by inodes we get: > >>> > >>> (1872308 - 42000)/3139091 == 583 bytes per inode > >>> > >>> Not sure about what calculation is correct. > >> > >> I think the first one is right because you still haven't used up all the > >> inodes.(2036637 used vs. the max. permissible 3139091). But again this > >> is an approximation because not all files would be 899 bytes. For > >> example if there are a thousand files present in a directory, then du > >> would be more than du because the directory will take > >> some disk space to store the dentries. > >> > >>>I guess we should consider the one > >>> > >>> that accounts inodes because of .glusterfs/ folder data. > >>> > >>> Nevertheless, in contrast, documentation [2] says it should be 4096 > >>> bytes > >>> per file. Am I wrong with my calculations? > >> > >> The 4KB is a conservative estimate considering the fact that though the > >> arbiter brick does not store data, it still keeps a copy of both user > >> and gluster xattrs. For example, if the application sets a lot of > >> xattrs, it can consume a data block if they cannot be accommodated on > >> the inode itself. Also there is the .glusterfs folder like you said > >> which would take up some space. Here is what I tried on an XFS brick: > >> [root@ravi4 brick]# touch file > >> > >> [root@ravi4 brick]# ls -l file > >> -rw-r--r-- 1 root root 0 Mar 8 12:54 file > >> > >> [root@ravi4 brick]# du file > >> *0 file** > >> * > >> [root@ravi4 brick]# for i in {1..100} > >> > >> > do > >> > setfattr -n user.value$i -v value$i file > >> > done > >> > >> [root@ravi4 brick]# ll -l file > >> -rw-r--r-- 1 root root 0 Mar 8 12:54 file > >> > >> [root@ravi4 brick]# du -h file > >> *4.0Kfile** > >> * > >> Hope this helps, > >> Ravi > >> > >>> Pranith? > >>> > >>> [1] http://termbin.com/ka9x > >>> [2] > >>> http://gluster.readthedocs.org/en/latest/Administrator%20Guide/arbiter-v > >>> o > >>> lumes-and-quorum/ ___ > >>> Gluster-devel mailing list > >>> Gluster-devel@gluster.org > >>> http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] Arbiter brick size estimation
Hi. On вівторок, 8 березня 2016 р. 19:13:05 EET Ravishankar N wrote: > I think the first one is right because you still haven't used up all the > inodes.(2036637 used vs. the max. permissible 3139091). But again this > is an approximation because not all files would be 899 bytes. For > example if there are a thousand files present in a directory, then du > would be more than du because the directory will take > some disk space to store the dentries. I believe you've got me wrong. 2036637 is the number of files+folders. 3139091 is the amount of inodes actually allocated on the underlying FS (according to df -i information). The max. inodes number is much higher than that, and I do not take it into account. Also, probably, I should recheck the results for 1000 files per folder to make it sure. > The 4KB is a conservative estimate considering the fact that though the > arbiter brick does not store data, it still keeps a copy of both user > and gluster xattrs. For example, if the application sets a lot of > xattrs, it can consume a data block if they cannot be accommodated on > the inode itself. Also there is the .glusterfs folder like you said > which would take up some space. Here is what I tried on an XFS brick: 4KB as upper level sounds and looks reasonable to me, thanks. But the average value will be still lower, I believe, as it is uncommon for apps to set lots of xattrs, especially for ordinary deployment. Regards, Oleksandr. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] Arbiter brick size estimation
In order to estimate GlusterFS arbiter brick size, I've deployed test setup with replica 3 arbiter 1 volume within one node. Each brick is located on separate HDD (XFS with inode size == 512). Using GlusterFS v3.7.6 + memleak patches. Volume options are kept default. Here is the script that creates files and folders in mounted volume: [1] The script creates 1M of files of random size (between 1 and 32768 bytes) and some amount of folders. After running it I've got 1036637 folders. So, in total it is 2036637 files and folders. The initial used space on each brick is 42M . After running script I've got: replica brick 1 and 2: 19867168 kbytes == 19G arbiter brick: 1872308 kbytes == 1.8G The amount of inodes on each brick is 3139091. So here goes estimation. Dividing arbiter used space by files+folders we get: (1872308 - 42000)/2036637 == 899 bytes per file or folder Dividing arbiter used space by inodes we get: (1872308 - 42000)/3139091 == 583 bytes per inode Not sure about what calculation is correct. I guess we should consider the one that accounts inodes because of .glusterfs/ folder data. Nevertheless, in contrast, documentation [2] says it should be 4096 bytes per file. Am I wrong with my calculations? Pranith? [1] http://termbin.com/ka9x [2] http://gluster.readthedocs.org/en/latest/Administrator%20Guide/arbiter-volumes-and-quorum/ ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] 3.7.8 client is slow
David, could you please cross-post your observations to the following bugreport: https://bugzilla.redhat.com/show_bug.cgi?id=1309462 ? It seems you have faced similar issue. On понеділок, 22 лютого 2016 р. 16:46:01 EET David Robinson wrote: > The 3.7.8 FUSE client is significantly slower than 3.7.6. Is this > related to some of the fixes that were done to correct memory leaks? Is > there anything that I can do to recover the performance of 3.7.6? > > My testing involved creating a "bigfile" that is 20GB. I then installed > the 3.6.6 FUSE client and tested the copy of the bigfile from one > gluster machine to another. The test was repeated 2x to make sure cache > wasn't affect performance. > > Using Centos7.1 > FUSE 3.6.6 took 47-seconds and 38-seconds. > FUSE 3.7.6 took 43-seconds and 34-seconds. > FUSE 3.7.8 took 205-seconds and 224-seconds > > I repeated the test on another machine that is running centos 6.7 and > the results were even worse. 98-seconds for FUSE 3.6.6 versus > 575-seconds for FUSE 3.7.8. > > My server setup is: > > Volume Name: gfsbackup > Type: Distribute > Volume ID: 29b8fae9-dfbf-4fa4-9837-8059a310669a > Status: Started > Number of Bricks: 2 > Transport-type: tcp > Bricks: > Brick1: ffib01bkp:/data/brick01/gfsbackup > Brick2: ffib01bkp:/data/brick02/gfsbackup > Options Reconfigured: > performance.readdir-ahead: on > cluster.rebal-throttle: aggressive > diagnostics.client-log-level: WARNING > diagnostics.brick-log-level: WARNING > changelog.changelog: off > client.event-threads: 8 > server.event-threads: 8 > > David > > > > > > > > David F. Robinson, Ph.D. > > President - Corvid Technologies > > 145 Overhill Drive > > Mooresville, NC 28117 > > 704.799.6944 x101 [Office] > > 704.252.1310 [Cell] > > 704.799.7974 [Fax] > > david.robin...@corvidtec.com > > http://www.corvidtec.com ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] GlusterFS v3.7.8 client leaks summary — part II
Hmm, OK. I've rechecked 3.7.8 with the following patches (latest revisions): === Soumya Koduri (3): gfapi: Use inode_forget in case of handle objects inode: Retire the inodes from the lru list in inode_table_destroy rpc: Fix for rpc_transport_t leak === Here is Valgrind output: [1] It seems that all leaks are gone, and that is very nice. Many thanks to all devs. [1] https://gist.github.com/anonymous/eddfdaf3eb7bff458326 16.02.2016 15:30, Soumya Koduri wrote: I have tested using your API app (I/Os done - create,write and stat). I still do not see any inode related leaks. However I posted another fix for rpc_transport object related leak [1]. I request you to re-check if you have the latest patch of [2] applied in your build. [1] http://review.gluster.org/#/c/13456/ [2] http://review.gluster.org/#/c/13125/ ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] GlusterFS v3.7.8 client leaks summary — part II
And "API" test. I used custom API app [1] and did brief file manipulations through it (create/remove/stat). Then I performed drop_caches, finished API [2] and got the following Valgrind log [3]. I believe there are still some leaks occurring in glfs_lresolve() call chain. Soumya? [1] https://github.com/pfactum/xglfs [2] https://github.com/pfactum/xglfs/blob/master/xglfs_destroy.c#L30 [3] https://gist.github.com/aec72b6164a695cf2d44 11.02.2016 10:12, Oleksandr Natalenko написав: And here goes "rsync" test results (v3.7.8 + two patches by Soumya). 2 volumes involved: source and target. === Common indicators === slabtop before drop_caches: [1] slabtop after drop_caches: [2] === Source volume (less interesting part) === RAM usage before drop_caches: [3] statedump before drop_caches: [4] RAM usage after drop_caches: [5] statedump after drop_caches: [6] === Target volume (most interesting part) === RAM usage before drop_caches: [7] statedump before drop_caches: [8] RAM usage after drop_caches: [9] statedump after drop_caches: [10] Valgrind output: [11] === Conclusion === Again, see no obvious leaks. [1] https://gist.github.com/e72fd30a4198dd630299 [2] https://gist.github.com/78ef9eae3dc16fd79c1b [3] https://gist.github.com/4ed75e8d6cb40a1369d8 [4] https://gist.github.com/20a75d32db76795b90d4 [5] https://gist.github.com/0772959834610dfdaf2d [6] https://gist.github.com/a71684bd3745c77c41eb [7] https://gist.github.com/2c9be083cfe3bffe6cec [8] https://gist.github.com/0102a16c94d3d8eb82e3 [9] https://gist.github.com/23f057dc8e4b2902bba1 [10] https://gist.github.com/385bbb95ca910ec9766f [11] https://gist.github.com/685c4d3e13d31f597722 10.02.2016 15:37, Oleksandr Natalenko написав: Hi, folks. Here go new test results regarding client memory leak. I use v3.7.8 with the following patches: === Soumya Koduri (2): inode: Retire the inodes from the lru list in inode_table_destroy gfapi: Use inode_forget in case of handle objects === Those are the only 2 not merged yet. So far, I've performed only "find" test, and here are the results: RAM usage before drop_caches: [1] statedump before drop_caches: [2] slabtop before drop_caches: [3] RAM usage after drop_caches: [4] statedump after drop_caches: [5] slabtop after drop_caches: [6] Valgrind output: [7] No leaks either via statedump or via valgrind. However, statedump stats still suffer from integer overflow. Next steps I'm going to take: 1) "rsync" test; 2) API test. [1] https://gist.github.com/88d2fa95c28baeb2543f [2] https://gist.github.com/4f3e93ff2db6e3cf4081 [3] https://gist.github.com/62791a2c4258041ba821 [4] https://gist.github.com/1d3ce95a493d054bbac2 [5] https://gist.github.com/fa855a2752d3691365a7 [6] https://gist.github.com/84e9e27d2a2e5ff5dc33 [7] https://gist.github.com/f35bd32a5159d3571d3a ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-users mailing list gluster-us...@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] GlusterFS v3.7.8 client leaks summary — part II
And here goes "rsync" test results (v3.7.8 + two patches by Soumya). 2 volumes involved: source and target. === Common indicators === slabtop before drop_caches: [1] slabtop after drop_caches: [2] === Source volume (less interesting part) === RAM usage before drop_caches: [3] statedump before drop_caches: [4] RAM usage after drop_caches: [5] statedump after drop_caches: [6] === Target volume (most interesting part) === RAM usage before drop_caches: [7] statedump before drop_caches: [8] RAM usage after drop_caches: [9] statedump after drop_caches: [10] Valgrind output: [11] === Conclusion === Again, see no obvious leaks. [1] https://gist.github.com/e72fd30a4198dd630299 [2] https://gist.github.com/78ef9eae3dc16fd79c1b [3] https://gist.github.com/4ed75e8d6cb40a1369d8 [4] https://gist.github.com/20a75d32db76795b90d4 [5] https://gist.github.com/0772959834610dfdaf2d [6] https://gist.github.com/a71684bd3745c77c41eb [7] https://gist.github.com/2c9be083cfe3bffe6cec [8] https://gist.github.com/0102a16c94d3d8eb82e3 [9] https://gist.github.com/23f057dc8e4b2902bba1 [10] https://gist.github.com/385bbb95ca910ec9766f [11] https://gist.github.com/685c4d3e13d31f597722 10.02.2016 15:37, Oleksandr Natalenko написав: Hi, folks. Here go new test results regarding client memory leak. I use v3.7.8 with the following patches: === Soumya Koduri (2): inode: Retire the inodes from the lru list in inode_table_destroy gfapi: Use inode_forget in case of handle objects === Those are the only 2 not merged yet. So far, I've performed only "find" test, and here are the results: RAM usage before drop_caches: [1] statedump before drop_caches: [2] slabtop before drop_caches: [3] RAM usage after drop_caches: [4] statedump after drop_caches: [5] slabtop after drop_caches: [6] Valgrind output: [7] No leaks either via statedump or via valgrind. However, statedump stats still suffer from integer overflow. Next steps I'm going to take: 1) "rsync" test; 2) API test. [1] https://gist.github.com/88d2fa95c28baeb2543f [2] https://gist.github.com/4f3e93ff2db6e3cf4081 [3] https://gist.github.com/62791a2c4258041ba821 [4] https://gist.github.com/1d3ce95a493d054bbac2 [5] https://gist.github.com/fa855a2752d3691365a7 [6] https://gist.github.com/84e9e27d2a2e5ff5dc33 [7] https://gist.github.com/f35bd32a5159d3571d3a ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] GlusterFS v3.7.8 client leaks summary — part II
Hi, folks. Here go new test results regarding client memory leak. I use v3.7.8 with the following patches: === Soumya Koduri (2): inode: Retire the inodes from the lru list in inode_table_destroy gfapi: Use inode_forget in case of handle objects === Those are the only 2 not merged yet. So far, I've performed only "find" test, and here are the results: RAM usage before drop_caches: [1] statedump before drop_caches: [2] slabtop before drop_caches: [3] RAM usage after drop_caches: [4] statedump after drop_caches: [5] slabtop after drop_caches: [6] Valgrind output: [7] No leaks either via statedump or via valgrind. However, statedump stats still suffer from integer overflow. Next steps I'm going to take: 1) "rsync" test; 2) API test. [1] https://gist.github.com/88d2fa95c28baeb2543f [2] https://gist.github.com/4f3e93ff2db6e3cf4081 [3] https://gist.github.com/62791a2c4258041ba821 [4] https://gist.github.com/1d3ce95a493d054bbac2 [5] https://gist.github.com/fa855a2752d3691365a7 [6] https://gist.github.com/84e9e27d2a2e5ff5dc33 [7] https://gist.github.com/f35bd32a5159d3571d3a ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] GlusterFS FUSE client leaks summary — part I
Here goes the report on DHT-related leaks patch ("rsync" test). RAM usage before drop_caches: [1] Statedump before drop_caches: [2] RAM usage after drop_caches: [3] Statedump after drop_caches: [4] Statedumps diff: [5] Valgrind output: [6] [1] https://gist.github.com/ca8d56834c14c4bfa98e [2] https://gist.github.com/06dc910d7261750d486c [3] https://gist.github.com/c482b170848a21b6e5f3 [4] https://gist.github.com/ed7f56336b4cbf39f7e8 [5] https://gist.github.com/f8597f34b56d949f7dcb [6] https://gist.github.com/102fc2d2dfa2d2d179fa I guess, the patch works. 29.01.2016 23:11, Vijay Bellur написав: On 01/29/2016 01:09 PM, Oleksandr Natalenko wrote: Here is intermediate summary of current memory leaks in FUSE client investigation. I use GlusterFS v3.7.6 release with the following patches: === Kaleb S KEITHLEY (1): fuse: use-after-free fix in fuse-bridge, revisited Pranith Kumar K (1): mount/fuse: Fix use-after-free crash Soumya Koduri (3): gfapi: Fix inode nlookup counts inode: Retire the inodes from the lru list in inode_table_destroy upcall: free the xdr* allocations === With those patches we got API leaks fixed (I hope, brief tests show that) and got rid of "kernel notifier loop terminated" message. Nevertheless, FUSE client still leaks. I have several test volumes with several million of small files (100K…2M in average). I do 2 types of FUSE client testing: 1) find /mnt/volume -type d 2) rsync -av -H /mnt/source_volume/* /mnt/target_volume/ And most up-to-date results are shown below: === find /mnt/volume -type d === Memory consumption: ~4G Statedump: https://gist.github.com/10cde83c63f1b4f1dd7a Valgrind: https://gist.github.com/097afb01ebb2c5e9e78d I guess, fuse-bridge/fuse-resolve. related. === rsync -av -H /mnt/source_volume/* /mnt/target_volume/ === Memory consumption: ~3.3...4G Statedump (target volume): https://gist.github.com/31e43110eaa4da663435 Valgrind (target volume): https://gist.github.com/f8e0151a6878cacc9b1a I guess, DHT-related. Give me more patches to test :). Thank you as ever for your detailed reports! This patch should help the dht leaks observed as part of dht_do_rename() in valgrind logs of target volume. http://review.gluster.org/#/c/13322/ Can you please verify if this indeed helps? Regards, Vijay ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] GlusterFS FUSE client leaks summary — part I
02.02.2016 10:07, Xavier Hernandez написав: Could it be memory used by Valgrind itself to track glusterfs' memory usage ? Could you repeat the test without Valgrind and see if the memory usage after dropping caches returns to low values ? Yup. Here are the results: === pf@server:~ » ps aux | grep volume root 19412 14.4 10.0 5416964 4971692 ? Ssl 10:15 36:32 /usr/sbin/glusterfs --volfile-server=server.example.com --volfile-id=volume /mnt/volume pf@server:~ » echo 2 | sudo tee /proc/sys/vm/drop_caches 2 pf@server:~ » ps aux | grep volume root 19412 13.6 3.5 2336772 1740804 ? Ssl 10:15 36:53 /usr/sbin/glusterfs --volfile-server=server.example.com --volfile-id=volume /mnt/volume === Dropped from 4.9G to 1.7G. But fresh mount consumes only 25M (megabytes): === root 23347 0.7 0.0 698376 25124 ?Ssl 14:49 0:00 /usr/sbin/glusterfs --volfile-server=server.example.com --volfile-id=volume /mnt/volume === Why? Examining statedump shows only the following snippet with high "size" value: === [mount/fuse.fuse - usage-type gf_fuse_mt_iov_base memusage] size=4234592647 num_allocs=1 max_size=4294935223 max_num_allocs=3 total_allocs=4186991 === Another leak? Grepping "gf_fuse_mt_iov_base" on GlusterFS source tree shows the following: === $ grep -Rn gf_fuse_mt_iov_base xlators/mount/fuse/src/fuse-mem-types.h:20: gf_fuse_mt_iov_base, xlators/mount/fuse/src/fuse-bridge.c:4887: gf_fuse_mt_iov_base); === fuse-bridge.c snippet: === /* Add extra 128 byte to the first iov so that it can * accommodate "ordinary" non-write requests. It's not * guaranteed to be big enough, as SETXATTR and namespace * operations with very long names may grow behind it, * but it's good enough in most cases (and we can handle * rest via realloc). */ iov_in[0].iov_base = GF_CALLOC (1, msg0_size, gf_fuse_mt_iov_base); === Probably, some freeing missing for iov_base? This is not a real memory leak. It's only a bad accounting of memory. Note that num_allocs is 1. If you look at libglusterfs/src/mem-pool.c, you will see this: /* TBD: it would be nice to adjust the memory accounting info here, * but calling gf_mem_set_acct_info here is wrong because it bumps * up counts as though this is a new allocation - which it's not. * The consequence of doing nothing here is only that the sizes will be * wrong, but at least the counts won't be. uint32_t type = 0; xlator_t *xl = NULL; type = header->type; xl = (xlator_t *) header->xlator; gf_mem_set_acct_info (xl, &new_ptr, size, type, NULL); */ This means that memory reallocs are not correctly accounted, so the tracked size is incorrect (note that fuse_thread_proc() calls GF_REALLOC() in some cases). There are two problems here: 1. The memory is allocated with a given size S1, then reallocated with a size S2 (S2 > S1), but not accounted, so the memory accounting system still thinks that the allocated size is S1. When memory is freed, S2 is substracted from the total size used. With enough allocs/reallocs/frees, this value becomes negative. 2. statedump shows the 64-bit 'size' field representing the total memory used by a given type as an unsigned 32-bit value, loosing some information. Xavi [1] https://gist.github.com/f0cf98e8bff0c13ea38f [2] https://gist.github.com/87baa0a778ba54f0f7f7 [3] https://gist.github.com/7013b493d19c8c5fffae [4] https://gist.github.com/cc38155b57e68d7e86d5 [5] https://gist.github.com/6a24000c77760a97976a [6] https://gist.github.com/74bd7a9f734c2fd21c33 On понеділок, 1 лютого 2016 р. 14:24:22 EET Soumya Koduri wrote: On 02/01/2016 01:39 PM, Oleksandr Natalenko wrote: Wait. It seems to be my bad. Before unmounting I do drop_caches (2), and glusterfs process CPU usage goes to 100% for a while. I haven't waited for it to drop to 0%, and instead perform unmount. It seems glusterfs is purging inodes and that's why it uses 100% of CPU. I've re-tested it, waiting for CPU usage to become normal, and got no leaks. Will verify this once again and report more. BTW, if that works, how could I limit inode cache for FUSE client? I do not want it to go beyond 1G, for example, even if I have 48G of RAM on my server. Its hard-coded for now. For fuse the lru limit (of the inodes which are not active) is (32*1024). One of the ways to address this (which we were discussing earlier) is to have an option to configure inode cache limit. If that sounds good, we can then check on if it has to be global/volume-level, client/server/both. Thanks, Soumya 01.02.2016 09:54, Soumya Koduri написав: On 01/31/2016 03:05 PM, Oleksandr Natalenko wrote: Unfortunately, this patch doesn't h
Re: [Gluster-devel] GlusterFS FUSE client leaks summary — part I
Please take a look at updated test results. Test: find /mnt/volume -type d RAM usage after "find" finishes: ~ 10.8G (see "ps" output [1]). Statedump after "find" finishes: [2]. Then I did drop_caches, and RAM usage dropped to ~4.7G [3]. Statedump after drop_caches: [4]. Here is diff between statedumps: [5]. And, finally, Valgrind output: [6]. Definitely, no major leaks on exit, but why glusterfs process uses almost 5G of RAM after drop_caches? Examining statedump shows only the following snippet with high "size" value: === [mount/fuse.fuse - usage-type gf_fuse_mt_iov_base memusage] size=4234592647 num_allocs=1 max_size=4294935223 max_num_allocs=3 total_allocs=4186991 === Another leak? Grepping "gf_fuse_mt_iov_base" on GlusterFS source tree shows the following: === $ grep -Rn gf_fuse_mt_iov_base xlators/mount/fuse/src/fuse-mem-types.h:20:gf_fuse_mt_iov_base, xlators/mount/fuse/src/fuse-bridge.c:4887: gf_fuse_mt_iov_base); === fuse-bridge.c snippet: === /* Add extra 128 byte to the first iov so that it can * accommodate "ordinary" non-write requests. It's not * guaranteed to be big enough, as SETXATTR and namespace * operations with very long names may grow behind it, * but it's good enough in most cases (and we can handle * rest via realloc). */ iov_in[0].iov_base = GF_CALLOC (1, msg0_size, gf_fuse_mt_iov_base); === Probably, some freeing missing for iov_base? [1] https://gist.github.com/f0cf98e8bff0c13ea38f [2] https://gist.github.com/87baa0a778ba54f0f7f7 [3] https://gist.github.com/7013b493d19c8c5fffae [4] https://gist.github.com/cc38155b57e68d7e86d5 [5] https://gist.github.com/6a24000c77760a97976a [6] https://gist.github.com/74bd7a9f734c2fd21c33 On понеділок, 1 лютого 2016 р. 14:24:22 EET Soumya Koduri wrote: > On 02/01/2016 01:39 PM, Oleksandr Natalenko wrote: > > Wait. It seems to be my bad. > > > > Before unmounting I do drop_caches (2), and glusterfs process CPU usage > > goes to 100% for a while. I haven't waited for it to drop to 0%, and > > instead perform unmount. It seems glusterfs is purging inodes and that's > > why it uses 100% of CPU. I've re-tested it, waiting for CPU usage to > > become normal, and got no leaks. > > > > Will verify this once again and report more. > > > > BTW, if that works, how could I limit inode cache for FUSE client? I do > > not want it to go beyond 1G, for example, even if I have 48G of RAM on > > my server. > > Its hard-coded for now. For fuse the lru limit (of the inodes which are > not active) is (32*1024). > One of the ways to address this (which we were discussing earlier) is to > have an option to configure inode cache limit. If that sounds good, we > can then check on if it has to be global/volume-level, client/server/both. > > Thanks, > Soumya > > > 01.02.2016 09:54, Soumya Koduri написав: > >> On 01/31/2016 03:05 PM, Oleksandr Natalenko wrote: > >>> Unfortunately, this patch doesn't help. > >>> > >>> RAM usage on "find" finish is ~9G. > >>> > >>> Here is statedump before drop_caches: https://gist.github.com/ > >>> fc1647de0982ab447e20 > >> > >> [mount/fuse.fuse - usage-type gf_common_mt_inode_ctx memusage] > >> size=706766688 > >> num_allocs=2454051 > >> > >>> And after drop_caches: https://gist.github.com/5eab63bc13f78787ed19 > >> > >> [mount/fuse.fuse - usage-type gf_common_mt_inode_ctx memusage] > >> size=550996416 > >> num_allocs=1913182 > >> > >> There isn't much significant drop in inode contexts. One of the > >> reasons could be because of dentrys holding a refcount on the inodes > >> which shall result in inodes not getting purged even after > >> fuse_forget. > >> > >> > >> pool-name=fuse:dentry_t > >> hot-count=32761 > >> > >> if '32761' is the current active dentry count, it still doesn't seem > >> to match up to inode count. > >> > >> Thanks, > >> Soumya > >> > >>> And here is Valgrind output: > >>> https://gist.github.com/2490aeac448320d98596 > >>> > >>> On субота, 30 січня 2016 р. 22:56:37 EET Xavier Hernandez wrote: > >>>> There's another inode leak caused by an incorrect counting of > >>>> lookups on directory reads. > >&g
Re: [Gluster-devel] GlusterFS FUSE client leaks summary — part I
Wait. It seems to be my bad. Before unmounting I do drop_caches (2), and glusterfs process CPU usage goes to 100% for a while. I haven't waited for it to drop to 0%, and instead perform unmount. It seems glusterfs is purging inodes and that's why it uses 100% of CPU. I've re-tested it, waiting for CPU usage to become normal, and got no leaks. Will verify this once again and report more. BTW, if that works, how could I limit inode cache for FUSE client? I do not want it to go beyond 1G, for example, even if I have 48G of RAM on my server. 01.02.2016 09:54, Soumya Koduri написав: On 01/31/2016 03:05 PM, Oleksandr Natalenko wrote: Unfortunately, this patch doesn't help. RAM usage on "find" finish is ~9G. Here is statedump before drop_caches: https://gist.github.com/ fc1647de0982ab447e20 [mount/fuse.fuse - usage-type gf_common_mt_inode_ctx memusage] size=706766688 num_allocs=2454051 And after drop_caches: https://gist.github.com/5eab63bc13f78787ed19 [mount/fuse.fuse - usage-type gf_common_mt_inode_ctx memusage] size=550996416 num_allocs=1913182 There isn't much significant drop in inode contexts. One of the reasons could be because of dentrys holding a refcount on the inodes which shall result in inodes not getting purged even after fuse_forget. pool-name=fuse:dentry_t hot-count=32761 if '32761' is the current active dentry count, it still doesn't seem to match up to inode count. Thanks, Soumya And here is Valgrind output: https://gist.github.com/2490aeac448320d98596 On субота, 30 січня 2016 р. 22:56:37 EET Xavier Hernandez wrote: There's another inode leak caused by an incorrect counting of lookups on directory reads. Here's a patch that solves the problem for 3.7: http://review.gluster.org/13324 Hopefully with this patch the memory leaks should disapear. Xavi On 29.01.2016 19:09, Oleksandr Natalenko wrote: Here is intermediate summary of current memory leaks in FUSE client investigation. I use GlusterFS v3.7.6 release with the following patches: === Kaleb S KEITHLEY (1): fuse: use-after-free fix in fuse-bridge, revisited Pranith Kumar K (1): mount/fuse: Fix use-after-free crash Soumya Koduri (3): gfapi: Fix inode nlookup counts inode: Retire the inodes from the lru list in inode_table_destroy upcall: free the xdr* allocations === With those patches we got API leaks fixed (I hope, brief tests show that) and got rid of "kernel notifier loop terminated" message. Nevertheless, FUSE client still leaks. I have several test volumes with several million of small files (100K…2M in average). I do 2 types of FUSE client testing: 1) find /mnt/volume -type d 2) rsync -av -H /mnt/source_volume/* /mnt/target_volume/ And most up-to-date results are shown below: === find /mnt/volume -type d === Memory consumption: ~4G Statedump: https://gist.github.com/10cde83c63f1b4f1dd7a Valgrind: https://gist.github.com/097afb01ebb2c5e9e78d I guess, fuse-bridge/fuse-resolve. related. === rsync -av -H /mnt/source_volume/* /mnt/target_volume/ === Memory consumption: ~3.3...4G Statedump (target volume): https://gist.github.com/31e43110eaa4da663435 Valgrind (target volume): https://gist.github.com/f8e0151a6878cacc9b1a I guess, DHT-related. Give me more patches to test :). ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] GlusterFS FUSE client leaks summary — part I
Unfortunately, this patch doesn't help. RAM usage on "find" finish is ~9G. Here is statedump before drop_caches: https://gist.github.com/ fc1647de0982ab447e20 And after drop_caches: https://gist.github.com/5eab63bc13f78787ed19 And here is Valgrind output: https://gist.github.com/2490aeac448320d98596 On субота, 30 січня 2016 р. 22:56:37 EET Xavier Hernandez wrote: > There's another inode leak caused by an incorrect counting of > lookups on directory reads. > > Here's a patch that solves the problem for > 3.7: > > http://review.gluster.org/13324 > > Hopefully with this patch the > memory leaks should disapear. > > Xavi > > On 29.01.2016 19:09, Oleksandr > > Natalenko wrote: > > Here is intermediate summary of current memory > > leaks in FUSE client > > > investigation. > > > > I use GlusterFS v3.7.6 > > release with the following patches: > > === > > > Kaleb S KEITHLEY (1): > fuse: use-after-free fix in fuse-bridge, revisited > > > Pranith Kumar K > > (1): > > mount/fuse: Fix use-after-free crash > > > Soumya Koduri (3): > gfapi: Fix inode nlookup counts > > > inode: Retire the inodes from the lru > > list in inode_table_destroy > > > upcall: free the xdr* allocations > > === > > > > > > With those patches we got API leaks fixed (I hope, brief tests show > > that) and > > > got rid of "kernel notifier loop terminated" message. > > Nevertheless, FUSE > > > client still leaks. > > > > I have several test > > volumes with several million of small files (100K…2M in > > > average). I > > do 2 types of FUSE client testing: > > 1) find /mnt/volume -type d > > 2) > > rsync -av -H /mnt/source_volume/* /mnt/target_volume/ > > > And most > > up-to-date results are shown below: > > === find /mnt/volume -type d > > === > > > Memory consumption: ~4G > > > Statedump: > https://gist.github.com/10cde83c63f1b4f1dd7a > > > Valgrind: > https://gist.github.com/097afb01ebb2c5e9e78d > > > I guess, > > fuse-bridge/fuse-resolve. related. > > > === rsync -av -H > > /mnt/source_volume/* /mnt/target_volume/ === > > > Memory consumption: > ~3.3...4G > > > Statedump (target volume): > https://gist.github.com/31e43110eaa4da663435 > > > Valgrind (target volume): > https://gist.github.com/f8e0151a6878cacc9b1a > > > I guess, > > DHT-related. > > > Give me more patches to test :). > > ___ > > > Gluster-devel mailing > > list > > > Gluster-devel@gluster.org > > http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] GlusterFS FUSE client leaks summary — part I
Here is intermediate summary of current memory leaks in FUSE client investigation. I use GlusterFS v3.7.6 release with the following patches: === Kaleb S KEITHLEY (1): fuse: use-after-free fix in fuse-bridge, revisited Pranith Kumar K (1): mount/fuse: Fix use-after-free crash Soumya Koduri (3): gfapi: Fix inode nlookup counts inode: Retire the inodes from the lru list in inode_table_destroy upcall: free the xdr* allocations === With those patches we got API leaks fixed (I hope, brief tests show that) and got rid of "kernel notifier loop terminated" message. Nevertheless, FUSE client still leaks. I have several test volumes with several million of small files (100K…2M in average). I do 2 types of FUSE client testing: 1) find /mnt/volume -type d 2) rsync -av -H /mnt/source_volume/* /mnt/target_volume/ And most up-to-date results are shown below: === find /mnt/volume -type d === Memory consumption: ~4G Statedump: https://gist.github.com/10cde83c63f1b4f1dd7a Valgrind: https://gist.github.com/097afb01ebb2c5e9e78d I guess, fuse-bridge/fuse-resolve. related. === rsync -av -H /mnt/source_volume/* /mnt/target_volume/ === Memory consumption: ~3.3...4G Statedump (target volume): https://gist.github.com/31e43110eaa4da663435 Valgrind (target volume): https://gist.github.com/f8e0151a6878cacc9b1a I guess, DHT-related. Give me more patches to test :). ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] Memory leak in GlusterFS FUSE client
OK, given GlusterFS v3.7.6 with the following patches: === Kaleb S KEITHLEY (1): fuse: use-after-free fix in fuse-bridge, revisited Pranith Kumar K (1): mount/fuse: Fix use-after-free crash Soumya Koduri (3): gfapi: Fix inode nlookup counts inode: Retire the inodes from the lru list in inode_table_destroy upcall: free the xdr* allocations === I've repeated "rsync" test under Valgrind, and here is Valgrind output: https://gist.github.com/f8e0151a6878cacc9b1a I see DHT-related leaks. On понеділок, 25 січня 2016 р. 02:46:32 EET Oleksandr Natalenko wrote: > Also, I've repeated the same "find" test again, but with glusterfs process > launched under valgrind. And here is valgrind output: > > https://gist.github.com/097afb01ebb2c5e9e78d > > On неділя, 24 січня 2016 р. 09:33:00 EET Mathieu Chateau wrote: > > Thanks for all your tests and times, it looks promising :) > > > > > > Cordialement, > > Mathieu CHATEAU > > http://www.lotp.fr > > > > 2016-01-23 22:30 GMT+01:00 Oleksandr Natalenko : > > > OK, now I'm re-performing tests with rsync + GlusterFS v3.7.6 + the > > > following > > > patches: > > > > > > === > > > > > > Kaleb S KEITHLEY (1): > > > fuse: use-after-free fix in fuse-bridge, revisited > > > > > > Pranith Kumar K (1): > > > mount/fuse: Fix use-after-free crash > > > > > > Soumya Koduri (3): > > > gfapi: Fix inode nlookup counts > > > inode: Retire the inodes from the lru list in inode_table_destroy > > > upcall: free the xdr* allocations > > > > > > === > > > > > > I run rsync from one GlusterFS volume to another. While memory started > > > from > > > under 100 MiBs, it stalled at around 600 MiBs for source volume and does > > > not > > > grow further. As for target volume it is ~730 MiBs, and that is why I'm > > > going > > > to do several rsync rounds to see if it grows more (with no patches bare > > > 3.7.6 > > > could consume more than 20 GiBs). > > > > > > No "kernel notifier loop terminated" message so far for both volumes. > > > > > > Will report more in several days. I hope current patches will be > > > incorporated > > > into 3.7.7. > > > > > > On пʼятниця, 22 січня 2016 р. 12:53:36 EET Kaleb S. KEITHLEY wrote: > > > > On 01/22/2016 12:43 PM, Oleksandr Natalenko wrote: > > > > > On пʼятниця, 22 січня 2016 р. 12:32:01 EET Kaleb S. KEITHLEY wrote: > > > > >> I presume by this you mean you're not seeing the "kernel notifier > > > > >> loop > > > > >> terminated" error in your logs. > > > > > > > > > > Correct, but only with simple traversing. Have to test under rsync. > > > > > > > > Without the patch I'd get "kernel notifier loop terminated" within a > > > > few > > > > minutes of starting I/O. With the patch I haven't seen it in 24 hours > > > > of beating on it. > > > > > > > > >> Hmmm. My system is not leaking. Last 24 hours the RSZ and VSZ are > > > > > > > >> stable: > > > http://download.gluster.org/pub/gluster/glusterfs/dynamic-analysis/longe > > > v > > > > > > > >> ity /client.out > > > > > > > > > > What ops do you perform on mounted volume? Read, write, stat? Is > > > > > that > > > > > 3.7.6 + patches? > > > > > > > > I'm running an internally developed I/O load generator written by a > > > > guy > > > > on our perf team. > > > > > > > > it does, create, write, read, rename, stat, delete, and more. > > ___ > Gluster-users mailing list > gluster-us...@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-users ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] Memory leak in GlusterFS FUSE client
Here are the results of "rsync" test. I've got 2 volumes — source and target — performing multiple files rsyncing from one volume to another. Source volume: === root 22259 3.5 1.5 1204200 771004 ? Ssl Jan23 109:42 /usr/sbin/ glusterfs --volfile-server=glusterfs.example.com --volfile-id=source /mnt/net/ glusterfs/source === One may see that memory consumption of source volume is not that high as with "find" test. Here is source volume client statedump: https://gist.github.com/ ef5b798859219e739aeb Here is source volume info: https://gist.github.com/3d2f32e7346df9333004 Target volume: === root 22200 23.8 6.9 3983676 3456252 ? Ssl Jan23 734:57 /usr/sbin/ glusterfs --volfile-server=glusterfs.example.com --volfile-id=target /mnt/net/ glusterfs/target === Here is target volume info: https://gist.github.com/c9de01168071575b109e Target volume RAM consumption is very high (more than 3 GiBs). Here is client statedump too: https://gist.github.com/31e43110eaa4da663435 I see huge DHT-related memory usage, e.g.: === [cluster/distribute.asterisk_records-dht - usage-type gf_common_mt_mem_pool memusage] size=725575592 num_allocs=7552486 max_size=725575836 max_num_allocs=7552489 total_allocs=90843958 [cluster/distribute.asterisk_records-dht - usage-type gf_common_mt_char memusage] size=586404954 num_allocs=7572836 max_size=586405157 max_num_allocs=7572839 total_allocs=80463096 === Ideas? On понеділок, 25 січня 2016 р. 02:46:32 EET Oleksandr Natalenko wrote: > Also, I've repeated the same "find" test again, but with glusterfs process > launched under valgrind. And here is valgrind output: > > https://gist.github.com/097afb01ebb2c5e9e78d > > On неділя, 24 січня 2016 р. 09:33:00 EET Mathieu Chateau wrote: > > Thanks for all your tests and times, it looks promising :) > > > > > > Cordialement, > > Mathieu CHATEAU > > http://www.lotp.fr > > > > 2016-01-23 22:30 GMT+01:00 Oleksandr Natalenko : > > > OK, now I'm re-performing tests with rsync + GlusterFS v3.7.6 + the > > > following > > > patches: > > > > > > === > > > > > > Kaleb S KEITHLEY (1): > > > fuse: use-after-free fix in fuse-bridge, revisited > > > > > > Pranith Kumar K (1): > > > mount/fuse: Fix use-after-free crash > > > > > > Soumya Koduri (3): > > > gfapi: Fix inode nlookup counts > > > inode: Retire the inodes from the lru list in inode_table_destroy > > > upcall: free the xdr* allocations > > > > > > === > > > > > > I run rsync from one GlusterFS volume to another. While memory started > > > from > > > under 100 MiBs, it stalled at around 600 MiBs for source volume and does > > > not > > > grow further. As for target volume it is ~730 MiBs, and that is why I'm > > > going > > > to do several rsync rounds to see if it grows more (with no patches bare > > > 3.7.6 > > > could consume more than 20 GiBs). > > > > > > No "kernel notifier loop terminated" message so far for both volumes. > > > > > > Will report more in several days. I hope current patches will be > > > incorporated > > > into 3.7.7. > > > > > > On пʼятниця, 22 січня 2016 р. 12:53:36 EET Kaleb S. KEITHLEY wrote: > > > > On 01/22/2016 12:43 PM, Oleksandr Natalenko wrote: > > > > > On пʼятниця, 22 січня 2016 р. 12:32:01 EET Kaleb S. KEITHLEY wrote: > > > > >> I presume by this you mean you're not seeing the "kernel notifier > > > > >> loop > > > > >> terminated" error in your logs. > > > > > > > > > > Correct, but only with simple traversing. Have to test under rsync. > > > > > > > > Without the patch I'd get "kernel notifier loop terminated" within a > > > > few > > > > minutes of starting I/O. With the patch I haven't seen it in 24 hours > > > > of beating on it. > > > > > > > > >> Hmmm. My system is not leaking. Last 24 hours the RSZ and VSZ are > > > > > > > >> stable: > > > http://download.gluster.org/pub/gluster/glusterfs/dynamic-analysis/longe > > > v > > > > > > > >> ity /client.out > > > > > > > > > > What ops do you perform on mounted volume? Read, write, stat? Is > > > > > that > > > > > 3.7.6 + patches? > > > > > > > > I'm running an internally developed I/O load generator written by a > > > > guy > > > > on our perf team. > > > > > > > > it does, create, write, read, rename, stat, delete, and more. > > ___ > Gluster-users mailing list > gluster-us...@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-users ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] Memory leak in GlusterFS FUSE client
Also, I've repeated the same "find" test again, but with glusterfs process launched under valgrind. And here is valgrind output: https://gist.github.com/097afb01ebb2c5e9e78d On неділя, 24 січня 2016 р. 09:33:00 EET Mathieu Chateau wrote: > Thanks for all your tests and times, it looks promising :) > > > Cordialement, > Mathieu CHATEAU > http://www.lotp.fr > > 2016-01-23 22:30 GMT+01:00 Oleksandr Natalenko : > > OK, now I'm re-performing tests with rsync + GlusterFS v3.7.6 + the > > following > > patches: > > > > === > > > > Kaleb S KEITHLEY (1): > > fuse: use-after-free fix in fuse-bridge, revisited > > > > Pranith Kumar K (1): > > mount/fuse: Fix use-after-free crash > > > > Soumya Koduri (3): > > gfapi: Fix inode nlookup counts > > inode: Retire the inodes from the lru list in inode_table_destroy > > upcall: free the xdr* allocations > > > > === > > > > I run rsync from one GlusterFS volume to another. While memory started > > from > > under 100 MiBs, it stalled at around 600 MiBs for source volume and does > > not > > grow further. As for target volume it is ~730 MiBs, and that is why I'm > > going > > to do several rsync rounds to see if it grows more (with no patches bare > > 3.7.6 > > could consume more than 20 GiBs). > > > > No "kernel notifier loop terminated" message so far for both volumes. > > > > Will report more in several days. I hope current patches will be > > incorporated > > into 3.7.7. > > > > On пʼятниця, 22 січня 2016 р. 12:53:36 EET Kaleb S. KEITHLEY wrote: > > > On 01/22/2016 12:43 PM, Oleksandr Natalenko wrote: > > > > On пʼятниця, 22 січня 2016 р. 12:32:01 EET Kaleb S. KEITHLEY wrote: > > > >> I presume by this you mean you're not seeing the "kernel notifier > > > >> loop > > > >> terminated" error in your logs. > > > > > > > > Correct, but only with simple traversing. Have to test under rsync. > > > > > > Without the patch I'd get "kernel notifier loop terminated" within a few > > > minutes of starting I/O. With the patch I haven't seen it in 24 hours > > > of beating on it. > > > > > > >> Hmmm. My system is not leaking. Last 24 hours the RSZ and VSZ are > > > > > >> stable: > > http://download.gluster.org/pub/gluster/glusterfs/dynamic-analysis/longev > > > > > >> ity /client.out > > > > > > > > What ops do you perform on mounted volume? Read, write, stat? Is that > > > > 3.7.6 + patches? > > > > > > I'm running an internally developed I/O load generator written by a guy > > > on our perf team. > > > > > > it does, create, write, read, rename, stat, delete, and more. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] Memory leak in GlusterFS FUSE client
BTW, am I the only one who sees in max_size=4294965480 almost 2^32? Could that be integer overflow? On неділя, 24 січня 2016 р. 13:23:55 EET Oleksandr Natalenko wrote: > The leak definitely remains. I did "find /mnt/volume -type d" over GlusterFS > volume, with mentioned patches applied and without "kernel notifier loop > terminated" message, but "glusterfs" process consumed ~4GiB of RAM after > "find" finished. > > Here is statedump: > > https://gist.github.com/10cde83c63f1b4f1dd7a > > I see the following: > > === > [mount/fuse.fuse - usage-type gf_fuse_mt_iov_base memusage] > size=4235109959 > num_allocs=2 > max_size=4294965480 > max_num_allocs=3 > total_allocs=4533524 > === > > ~4GiB, right? > > Pranith, Kaleb? > > On неділя, 24 січня 2016 р. 09:33:00 EET Mathieu Chateau wrote: > > Thanks for all your tests and times, it looks promising :) > > > > > > Cordialement, > > Mathieu CHATEAU > > http://www.lotp.fr > > > > 2016-01-23 22:30 GMT+01:00 Oleksandr Natalenko : > > > OK, now I'm re-performing tests with rsync + GlusterFS v3.7.6 + the > > > following > > > patches: > > > > > > === > > > > > > Kaleb S KEITHLEY (1): > > > fuse: use-after-free fix in fuse-bridge, revisited > > > > > > Pranith Kumar K (1): > > > mount/fuse: Fix use-after-free crash > > > > > > Soumya Koduri (3): > > > gfapi: Fix inode nlookup counts > > > inode: Retire the inodes from the lru list in inode_table_destroy > > > upcall: free the xdr* allocations > > > > > > === > > > > > > I run rsync from one GlusterFS volume to another. While memory started > > > from > > > under 100 MiBs, it stalled at around 600 MiBs for source volume and does > > > not > > > grow further. As for target volume it is ~730 MiBs, and that is why I'm > > > going > > > to do several rsync rounds to see if it grows more (with no patches bare > > > 3.7.6 > > > could consume more than 20 GiBs). > > > > > > No "kernel notifier loop terminated" message so far for both volumes. > > > > > > Will report more in several days. I hope current patches will be > > > incorporated > > > into 3.7.7. > > > > > > On пʼятниця, 22 січня 2016 р. 12:53:36 EET Kaleb S. KEITHLEY wrote: > > > > On 01/22/2016 12:43 PM, Oleksandr Natalenko wrote: > > > > > On пʼятниця, 22 січня 2016 р. 12:32:01 EET Kaleb S. KEITHLEY wrote: > > > > >> I presume by this you mean you're not seeing the "kernel notifier > > > > >> loop > > > > >> terminated" error in your logs. > > > > > > > > > > Correct, but only with simple traversing. Have to test under rsync. > > > > > > > > Without the patch I'd get "kernel notifier loop terminated" within a > > > > few > > > > minutes of starting I/O. With the patch I haven't seen it in 24 hours > > > > of beating on it. > > > > > > > > >> Hmmm. My system is not leaking. Last 24 hours the RSZ and VSZ are > > > > > > > >> stable: > > > http://download.gluster.org/pub/gluster/glusterfs/dynamic-analysis/longe > > > v > > > > > > > >> ity /client.out > > > > > > > > > > What ops do you perform on mounted volume? Read, write, stat? Is > > > > > that > > > > > 3.7.6 + patches? > > > > > > > > I'm running an internally developed I/O load generator written by a > > > > guy > > > > on our perf team. > > > > > > > > it does, create, write, read, rename, stat, delete, and more. > > ___ > Gluster-devel mailing list > Gluster-devel@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] Memory leak in GlusterFS FUSE client
The leak definitely remains. I did "find /mnt/volume -type d" over GlusterFS volume, with mentioned patches applied and without "kernel notifier loop terminated" message, but "glusterfs" process consumed ~4GiB of RAM after "find" finished. Here is statedump: https://gist.github.com/10cde83c63f1b4f1dd7a I see the following: === [mount/fuse.fuse - usage-type gf_fuse_mt_iov_base memusage] size=4235109959 num_allocs=2 max_size=4294965480 max_num_allocs=3 total_allocs=4533524 === ~4GiB, right? Pranith, Kaleb? On неділя, 24 січня 2016 р. 09:33:00 EET Mathieu Chateau wrote: > Thanks for all your tests and times, it looks promising :) > > > Cordialement, > Mathieu CHATEAU > http://www.lotp.fr > > 2016-01-23 22:30 GMT+01:00 Oleksandr Natalenko : > > OK, now I'm re-performing tests with rsync + GlusterFS v3.7.6 + the > > following > > patches: > > > > === > > > > Kaleb S KEITHLEY (1): > > fuse: use-after-free fix in fuse-bridge, revisited > > > > Pranith Kumar K (1): > > mount/fuse: Fix use-after-free crash > > > > Soumya Koduri (3): > > gfapi: Fix inode nlookup counts > > inode: Retire the inodes from the lru list in inode_table_destroy > > upcall: free the xdr* allocations > > > > === > > > > I run rsync from one GlusterFS volume to another. While memory started > > from > > under 100 MiBs, it stalled at around 600 MiBs for source volume and does > > not > > grow further. As for target volume it is ~730 MiBs, and that is why I'm > > going > > to do several rsync rounds to see if it grows more (with no patches bare > > 3.7.6 > > could consume more than 20 GiBs). > > > > No "kernel notifier loop terminated" message so far for both volumes. > > > > Will report more in several days. I hope current patches will be > > incorporated > > into 3.7.7. > > > > On пʼятниця, 22 січня 2016 р. 12:53:36 EET Kaleb S. KEITHLEY wrote: > > > On 01/22/2016 12:43 PM, Oleksandr Natalenko wrote: > > > > On пʼятниця, 22 січня 2016 р. 12:32:01 EET Kaleb S. KEITHLEY wrote: > > > >> I presume by this you mean you're not seeing the "kernel notifier > > > >> loop > > > >> terminated" error in your logs. > > > > > > > > Correct, but only with simple traversing. Have to test under rsync. > > > > > > Without the patch I'd get "kernel notifier loop terminated" within a few > > > minutes of starting I/O. With the patch I haven't seen it in 24 hours > > > of beating on it. > > > > > > >> Hmmm. My system is not leaking. Last 24 hours the RSZ and VSZ are > > > > > >> stable: > > http://download.gluster.org/pub/gluster/glusterfs/dynamic-analysis/longev > > > > > >> ity /client.out > > > > > > > > What ops do you perform on mounted volume? Read, write, stat? Is that > > > > 3.7.6 + patches? > > > > > > I'm running an internally developed I/O load generator written by a guy > > > on our perf team. > > > > > > it does, create, write, read, rename, stat, delete, and more. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] GlusterFS FUSE client hangs on rsyncing lots of file
With "performance.client-io-threads" set to "off" no hangs occurred in 3 rsync/rm rounds. Could that be some fuse-bridge lock race? Will bring that option to "on" back again and try to get full statedump. On четвер, 21 січня 2016 р. 14:54:47 EET Raghavendra G wrote: > On Thu, Jan 21, 2016 at 10:49 AM, Pranith Kumar Karampuri < > > pkara...@redhat.com> wrote: > > On 01/18/2016 02:28 PM, Oleksandr Natalenko wrote: > >> XFS. Server side works OK, I'm able to mount volume again. Brick is 30% > >> full. > > > > Oleksandr, > > > > Will it be possible to get the statedump of the client, bricks > > > > output next time it happens? > > > > https://github.com/gluster/glusterfs/blob/master/doc/debugging/statedump.m > > d#how-to-generate-statedump > We also need to dump inode information. To do that you've to add "all=yes" > to /var/run/gluster/glusterdump.options before you issue commands to get > statedump. > > > Pranith > > > >> On понеділок, 18 січня 2016 р. 15:07:18 EET baul jianguo wrote: > >>> What is your brick file system? and the glusterfsd process and all > >>> thread status? > >>> I met same issue when client app such as rsync stay in D status,and > >>> the brick process and relate thread also be in the D status. > >>> And the brick dev disk util is 100% . > >>> > >>> On Sun, Jan 17, 2016 at 6:13 AM, Oleksandr Natalenko > >>> > >>> wrote: > >>>> Wrong assumption, rsync hung again. > >>>> > >>>> On субота, 16 січня 2016 р. 22:53:04 EET Oleksandr Natalenko wrote: > >>>>> One possible reason: > >>>>> > >>>>> cluster.lookup-optimize: on > >>>>> cluster.readdir-optimize: on > >>>>> > >>>>> I've disabled both optimizations, and at least as of now rsync still > >>>>> does > >>>>> its job with no issues. I would like to find out what option causes > >>>>> such > >>>>> a > >>>>> behavior and why. Will test more. > >>>>> > >>>>> On пʼятниця, 15 січня 2016 р. 16:09:51 EET Oleksandr Natalenko wrote: > >>>>>> Another observation: if rsyncing is resumed after hang, rsync itself > >>>>>> hangs a lot faster because it does stat of already copied files. So, > >>>>>> the > >>>>>> reason may be not writing itself, but massive stat on GlusterFS > >>>>>> volume > >>>>>> as well. > >>>>>> > >>>>>> 15.01.2016 09:40, Oleksandr Natalenko написав: > >>>>>>> While doing rsync over millions of files from ordinary partition to > >>>>>>> GlusterFS volume, just after approx. first 2 million rsync hang > >>>>>>> happens, and the following info appears in dmesg: > >>>>>>> > >>>>>>> === > >>>>>>> [17075038.924481] INFO: task rsync:10310 blocked for more than 120 > >>>>>>> seconds. > >>>>>>> [17075038.931948] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > >>>>>>> disables this message. > >>>>>>> [17075038.940748] rsync D 88207fc13680 0 10310 > >>>>>>> 10309 0x0080 > >>>>>>> [17075038.940752] 8809c578be18 0086 > >>>>>>> 8809c578bfd8 > >>>>>>> 00013680 > >>>>>>> [17075038.940756] 8809c578bfd8 00013680 > >>>>>>> 880310cbe660 > >>>>>>> 881159d16a30 > >>>>>>> [17075038.940759] 881e3aa25800 8809c578be48 > >>>>>>> 881159d16b10 > >>>>>>> 88087d553980 > >>>>>>> [17075038.940762] Call Trace: > >>>>>>> [17075038.940770] [] schedule+0x29/0x70 > >>>>>>> [17075038.940797] [] > >>>>>>> __fuse_request_send+0x13d/0x2c0 > >>>>>>> [fuse] > >>>>>>> [17075038.940801] [] ? > >>>>>>> fuse_get_req_nofail_nopages+0xc0/0x1e0 [fuse] > >>>>>>> [17075038.940805] [] ? wake_up_bit+0x30/0x30 > >>>>>>> [17075038.9408
Re: [Gluster-devel] [Gluster-users] Memory leak in GlusterFS FUSE client
OK, now I'm re-performing tests with rsync + GlusterFS v3.7.6 + the following patches: === Kaleb S KEITHLEY (1): fuse: use-after-free fix in fuse-bridge, revisited Pranith Kumar K (1): mount/fuse: Fix use-after-free crash Soumya Koduri (3): gfapi: Fix inode nlookup counts inode: Retire the inodes from the lru list in inode_table_destroy upcall: free the xdr* allocations === I run rsync from one GlusterFS volume to another. While memory started from under 100 MiBs, it stalled at around 600 MiBs for source volume and does not grow further. As for target volume it is ~730 MiBs, and that is why I'm going to do several rsync rounds to see if it grows more (with no patches bare 3.7.6 could consume more than 20 GiBs). No "kernel notifier loop terminated" message so far for both volumes. Will report more in several days. I hope current patches will be incorporated into 3.7.7. On пʼятниця, 22 січня 2016 р. 12:53:36 EET Kaleb S. KEITHLEY wrote: > On 01/22/2016 12:43 PM, Oleksandr Natalenko wrote: > > On пʼятниця, 22 січня 2016 р. 12:32:01 EET Kaleb S. KEITHLEY wrote: > >> I presume by this you mean you're not seeing the "kernel notifier loop > >> terminated" error in your logs. > > > > Correct, but only with simple traversing. Have to test under rsync. > > Without the patch I'd get "kernel notifier loop terminated" within a few > minutes of starting I/O. With the patch I haven't seen it in 24 hours > of beating on it. > > >> Hmmm. My system is not leaking. Last 24 hours the RSZ and VSZ are > >> stable: > >> http://download.gluster.org/pub/gluster/glusterfs/dynamic-analysis/longev > >> ity /client.out > > > > What ops do you perform on mounted volume? Read, write, stat? Is that > > 3.7.6 + patches? > > I'm running an internally developed I/O load generator written by a guy > on our perf team. > > it does, create, write, read, rename, stat, delete, and more. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] Memory leak in GlusterFS FUSE client
On пʼятниця, 22 січня 2016 р. 12:32:01 EET Kaleb S. KEITHLEY wrote: > I presume by this you mean you're not seeing the "kernel notifier loop > terminated" error in your logs. Correct, but only with simple traversing. Have to test under rsync. > Hmmm. My system is not leaking. Last 24 hours the RSZ and VSZ are > stable: > http://download.gluster.org/pub/gluster/glusterfs/dynamic-analysis/longevity > /client.out What ops do you perform on mounted volume? Read, write, stat? Is that 3.7.6 + patches? ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] Memory leak in GlusterFS FUSE client
OK, compiles and runs well now, but still leaks. Will try to load the volume with rsync. On четвер, 21 січня 2016 р. 20:40:45 EET Kaleb KEITHLEY wrote: > On 01/21/2016 06:59 PM, Oleksandr Natalenko wrote: > > I see extra GF_FREE (node); added with two patches: > > > > === > > $ git diff HEAD~2 | gist > > https://gist.github.com/9524fa2054cc48278ea8 > > === > > > > Is that intentionally? I guess I face double-free issue. > > I presume you're referring to the release-3.7 branch. > > Yup, bad edit. Long day. That's why we review. ;-) > > Please try the latest. > > Thanks, > > -- > > Kaleb ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] Memory leak in GlusterFS FUSE client
I see extra GF_FREE (node); added with two patches: === $ git diff HEAD~2 | gist https://gist.github.com/9524fa2054cc48278ea8 === Is that intentionally? I guess I face double-free issue. On четвер, 21 січня 2016 р. 17:29:53 EET Kaleb KEITHLEY wrote: > On 01/20/2016 04:08 AM, Oleksandr Natalenko wrote: > > Yes, there are couple of messages like this in my logs too (I guess one > > message per each remount): > > > > === > > [2016-01-18 23:42:08.742447] I [fuse-bridge.c:3875:notify_kernel_loop] 0- > > glusterfs-fuse: kernel notifier loop terminated > > === > > Bug reports and fixes for master and release-3.7 branches are: > > master) > https://bugzilla.redhat.com/show_bug.cgi?id=1288857 > http://review.gluster.org/12886 > > release-3.7) > https://bugzilla.redhat.com/show_bug.cgi?id=1288922 > http://review.gluster.org/12887 > > The release-3.7 fix will be in glusterfs-3.7.7 when it's released. > > I think with even with the above fixes applied there are still some > issues remaining. I have submitted additional/revised fixes on top of > the above fixes at: > > master: http://review.gluster.org/13274 > release-3.7: http://review.gluster.org/13275 > > I invite you to review the patches in gerrit (review.gluster.org). > > Regards, > > -- > > Kaleb ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] Memory leak in GlusterFS FUSE client
With the proposed patches I get the following assertion while copying files to GlusterFS volume: === glusterfs: mem-pool.c:305: __gf_free: Assertion `0xCAFEBABE == header->magic' failed. Program received signal SIGABRT, Aborted. [Switching to Thread 0x7fffe9ffb700 (LWP 12635)] 0x76f215f8 in raise () from /usr/lib/libc.so.6 (gdb) bt #0 0x76f215f8 in raise () from /usr/lib/libc.so.6 #1 0x76f22a7a in abort () from /usr/lib/libc.so.6 #2 0x76f1a417 in __assert_fail_base () from /usr/lib/libc.so.6 #3 0x76f1a4c2 in __assert_fail () from /usr/lib/libc.so.6 #4 0x77b6046b in __gf_free (free_ptr=0x7fffdc0b8f00) at mem-pool.c: 305 #5 0x75144eb9 in notify_kernel_loop (data=0x63df90) at fuse-bridge.c: 3893 #6 0x772994a4 in start_thread () from /usr/lib/libpthread.so.0 #7 0x76fd713d in clone () from /usr/lib/libc.so.6 === On четвер, 21 січня 2016 р. 17:29:53 EET Kaleb KEITHLEY wrote: > On 01/20/2016 04:08 AM, Oleksandr Natalenko wrote: > > Yes, there are couple of messages like this in my logs too (I guess one > > message per each remount): > > > > === > > [2016-01-18 23:42:08.742447] I [fuse-bridge.c:3875:notify_kernel_loop] 0- > > glusterfs-fuse: kernel notifier loop terminated > > === > > Bug reports and fixes for master and release-3.7 branches are: > > master) > https://bugzilla.redhat.com/show_bug.cgi?id=1288857 > http://review.gluster.org/12886 > > release-3.7) > https://bugzilla.redhat.com/show_bug.cgi?id=1288922 > http://review.gluster.org/12887 > > The release-3.7 fix will be in glusterfs-3.7.7 when it's released. > > I think with even with the above fixes applied there are still some > issues remaining. I have submitted additional/revised fixes on top of > the above fixes at: > > master: http://review.gluster.org/13274 > release-3.7: http://review.gluster.org/13275 > > I invite you to review the patches in gerrit (review.gluster.org). > > Regards, > > -- > > Kaleb ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] Memory leak in GlusterFS FUSE client
I perform the tests using 1) rsync (massive copy of millions of files); 2) find (simple tree traversing). To check if memory leak happens, I use find tool. I've performed two traversing (w/ and w/o fopen-keep-cache=off) with remount between them, but I didn't encounter "kernel notifier loop terminated" message during both traversing as well as before unmounting volume. Nevertheless, memory still leaks (at least up to 3 GiB in each case), so I believe invalidation requests are not the case. I've also checked logs for the volume where I do rsync, and the message "kernel notifier loop terminated" happens somewhere in the middle of rsyncing, not before unmounting. But the memory starts leaking on rsync start as well, not just after "kernel notifier loop terminated" message. So, I believe, "kernel notifier loop terminated" is not the case again. Also, I've tried to implement quick and dirty GlusterFS FUSE client using API (see https://github.com/pfactum/xglfs), and with latest patches from this thread (http://review.gluster.org/#/c/13096/, http://review.gluster.org/#/c/ 13125/ and http://review.gluster.org/#/c/13232/) my FUSE client does not leak on tree traversing. So, I believe, this should be related to GlusterFS FUSE implementation. How could I debug memory leak better? On четвер, 21 січня 2016 р. 10:32:32 EET Xavier Hernandez wrote: > If this message appears way before the volume is unmounted, can you try > to start the volume manually using this command and repeat the tests ? > > glusterfs --fopen-keep-cache=off --volfile-server= > --volfile-id=/ > > This will prevent invalidation requests to be sent to the kernel, so > there shouldn't be any memory leak even if the worker thread exits > prematurely. > > If that solves the problem, we could try to determine the cause of the > premature exit and solve it. > > Xavi > > On 20/01/16 10:08, Oleksandr Natalenko wrote: > > Yes, there are couple of messages like this in my logs too (I guess one > > message per each remount): > > > > === > > [2016-01-18 23:42:08.742447] I [fuse-bridge.c:3875:notify_kernel_loop] 0- > > glusterfs-fuse: kernel notifier loop terminated > > === > > > > On середа, 20 січня 2016 р. 09:51:23 EET Xavier Hernandez wrote: > >> I'm seeing a similar problem with 3.7.6. > >> > >> This latest statedump contains a lot of gf_fuse_mt_invalidate_node_t > >> objects in fuse. Looking at the code I see they are used to send > >> invalidations to kernel fuse, however this is done in a separate thread > >> that writes a log message when it exits. On the system I'm seeing the > >> memory leak, I can see that message in the log files: > >> > >> [2016-01-18 23:04:55.384873] I [fuse-bridge.c:3875:notify_kernel_loop] > >> 0-glusterfs-fuse: kernel notifier loop terminated > >> > >> But the volume is still working at this moment, so any future inode > >> invalidations will leak memory because it was this thread that should > >> release it. > >> > >> Can you check if you also see this message in the mount log ? > >> > >> It seems that this thread terminates if write returns any error > >> different than ENOENT. I'm not sure if there could be any other error > >> that can cause this. > >> > >> Xavi > >> > >> On 20/01/16 00:13, Oleksandr Natalenko wrote: > >>> Here is another RAM usage stats and statedump of GlusterFS mount > >>> approaching to just another OOM: > >>> > >>> === > >>> root 32495 1.4 88.3 4943868 1697316 ? Ssl Jan13 129:18 > >>> /usr/sbin/ > >>> glusterfs --volfile-server=server.example.com --volfile-id=volume > >>> /mnt/volume === > >>> > >>> https://gist.github.com/86198201c79e927b46bd > >>> > >>> 1.6G of RAM just for almost idle mount (we occasionally store Asterisk > >>> recordings there). Triple OOM for 69 days of uptime. > >>> > >>> Any thoughts? > >>> > >>> On середа, 13 січня 2016 р. 16:26:59 EET Soumya Koduri wrote: > >>>> kill -USR1 > >>> > >>> ___ > >>> Gluster-devel mailing list > >>> Gluster-devel@gluster.org > >>> http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] Memory leak in GlusterFS FUSE client
Yes, there are couple of messages like this in my logs too (I guess one message per each remount): === [2016-01-18 23:42:08.742447] I [fuse-bridge.c:3875:notify_kernel_loop] 0- glusterfs-fuse: kernel notifier loop terminated === On середа, 20 січня 2016 р. 09:51:23 EET Xavier Hernandez wrote: > I'm seeing a similar problem with 3.7.6. > > This latest statedump contains a lot of gf_fuse_mt_invalidate_node_t > objects in fuse. Looking at the code I see they are used to send > invalidations to kernel fuse, however this is done in a separate thread > that writes a log message when it exits. On the system I'm seeing the > memory leak, I can see that message in the log files: > > [2016-01-18 23:04:55.384873] I [fuse-bridge.c:3875:notify_kernel_loop] > 0-glusterfs-fuse: kernel notifier loop terminated > > But the volume is still working at this moment, so any future inode > invalidations will leak memory because it was this thread that should > release it. > > Can you check if you also see this message in the mount log ? > > It seems that this thread terminates if write returns any error > different than ENOENT. I'm not sure if there could be any other error > that can cause this. > > Xavi > > On 20/01/16 00:13, Oleksandr Natalenko wrote: > > Here is another RAM usage stats and statedump of GlusterFS mount > > approaching to just another OOM: > > > > === > > root 32495 1.4 88.3 4943868 1697316 ? Ssl Jan13 129:18 > > /usr/sbin/ > > glusterfs --volfile-server=server.example.com --volfile-id=volume > > /mnt/volume === > > > > https://gist.github.com/86198201c79e927b46bd > > > > 1.6G of RAM just for almost idle mount (we occasionally store Asterisk > > recordings there). Triple OOM for 69 days of uptime. > > > > Any thoughts? > > > > On середа, 13 січня 2016 р. 16:26:59 EET Soumya Koduri wrote: > >> kill -USR1 > > > > ___ > > Gluster-devel mailing list > > Gluster-devel@gluster.org > > http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] Memory leak in GlusterFS FUSE client
And another statedump of FUSE mount client consuming more than 7 GiB of RAM: https://gist.github.com/136d7c49193c798b3ade DHT-related leak? On середа, 13 січня 2016 р. 16:26:59 EET Soumya Koduri wrote: > On 01/13/2016 04:08 PM, Soumya Koduri wrote: > > On 01/12/2016 12:46 PM, Oleksandr Natalenko wrote: > >> Just in case, here is Valgrind output on FUSE client with 3.7.6 + > >> API-related patches we discussed before: > >> > >> https://gist.github.com/cd6605ca19734c1496a4 > > > > Thanks for sharing the results. I made changes to fix one leak reported > > there wrt ' client_cbk_cache_invalidation' - > > > > - http://review.gluster.org/#/c/13232/ > > > > The other inode* related memory reported as lost is mainly (maybe) > > because fuse client process doesn't cleanup its memory (doesn't use > > fini()) while exiting the process. Hence majority of those allocations > > are listed as lost. But most of the inodes should have got purged when > > we drop vfs cache. Did you do drop vfs cache before exiting the process? > > > > I shall add some log statements and check that part > > Also please take statedump of the fuse mount process (after dropping vfs > cache) when you see high memory usage by issuing the following command - > 'kill -USR1 ' > > The statedump will be copied to 'glusterdump..dump.tim > estamp` file in /var/run/gluster or /usr/local/var/run/gluster. > Please refer to [1] for more information. > > Thanks, > Soumya > [1] http://review.gluster.org/#/c/8288/1/doc/debugging/statedump.md > > > Thanks, > > Soumya > > > >> 12.01.2016 08:24, Soumya Koduri написав: > >>> For fuse client, I tried vfs drop_caches as suggested by Vijay in an > >>> earlier mail. Though all the inodes get purged, I still doesn't see > >>> much difference in the memory footprint drop. Need to investigate what > >>> else is consuming so much memory here. > > > > ___ > > Gluster-users mailing list > > gluster-us...@gluster.org > > http://www.gluster.org/mailman/listinfo/gluster-users ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] Memory leak in GlusterFS FUSE client
Here is another RAM usage stats and statedump of GlusterFS mount approaching to just another OOM: === root 32495 1.4 88.3 4943868 1697316 ? Ssl Jan13 129:18 /usr/sbin/ glusterfs --volfile-server=server.example.com --volfile-id=volume /mnt/volume === https://gist.github.com/86198201c79e927b46bd 1.6G of RAM just for almost idle mount (we occasionally store Asterisk recordings there). Triple OOM for 69 days of uptime. Any thoughts? On середа, 13 січня 2016 р. 16:26:59 EET Soumya Koduri wrote: > kill -USR1 ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] GlusterFS FUSE client hangs on rsyncing lots of file
XFS. Server side works OK, I'm able to mount volume again. Brick is 30% full. On понеділок, 18 січня 2016 р. 15:07:18 EET baul jianguo wrote: > What is your brick file system? and the glusterfsd process and all > thread status? > I met same issue when client app such as rsync stay in D status,and > the brick process and relate thread also be in the D status. > And the brick dev disk util is 100% . > > On Sun, Jan 17, 2016 at 6:13 AM, Oleksandr Natalenko > > wrote: > > Wrong assumption, rsync hung again. > > > > On субота, 16 січня 2016 р. 22:53:04 EET Oleksandr Natalenko wrote: > >> One possible reason: > >> > >> cluster.lookup-optimize: on > >> cluster.readdir-optimize: on > >> > >> I've disabled both optimizations, and at least as of now rsync still does > >> its job with no issues. I would like to find out what option causes such > >> a > >> behavior and why. Will test more. > >> > >> On пʼятниця, 15 січня 2016 р. 16:09:51 EET Oleksandr Natalenko wrote: > >> > Another observation: if rsyncing is resumed after hang, rsync itself > >> > hangs a lot faster because it does stat of already copied files. So, > >> > the > >> > reason may be not writing itself, but massive stat on GlusterFS volume > >> > as well. > >> > > >> > 15.01.2016 09:40, Oleksandr Natalenko написав: > >> > > While doing rsync over millions of files from ordinary partition to > >> > > GlusterFS volume, just after approx. first 2 million rsync hang > >> > > happens, and the following info appears in dmesg: > >> > > > >> > > === > >> > > [17075038.924481] INFO: task rsync:10310 blocked for more than 120 > >> > > seconds. > >> > > [17075038.931948] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > >> > > disables this message. > >> > > [17075038.940748] rsync D 88207fc13680 0 10310 > >> > > 10309 0x0080 > >> > > [17075038.940752] 8809c578be18 0086 8809c578bfd8 > >> > > 00013680 > >> > > [17075038.940756] 8809c578bfd8 00013680 880310cbe660 > >> > > 881159d16a30 > >> > > [17075038.940759] 881e3aa25800 8809c578be48 881159d16b10 > >> > > 88087d553980 > >> > > [17075038.940762] Call Trace: > >> > > [17075038.940770] [] schedule+0x29/0x70 > >> > > [17075038.940797] [] > >> > > __fuse_request_send+0x13d/0x2c0 > >> > > [fuse] > >> > > [17075038.940801] [] ? > >> > > fuse_get_req_nofail_nopages+0xc0/0x1e0 [fuse] > >> > > [17075038.940805] [] ? wake_up_bit+0x30/0x30 > >> > > [17075038.940809] [] fuse_request_send+0x12/0x20 > >> > > [fuse] > >> > > [17075038.940813] [] fuse_flush+0xff/0x150 [fuse] > >> > > [17075038.940817] [] filp_close+0x34/0x80 > >> > > [17075038.940821] [] __close_fd+0x78/0xa0 > >> > > [17075038.940824] [] SyS_close+0x23/0x50 > >> > > [17075038.940828] [] > >> > > system_call_fastpath+0x16/0x1b > >> > > === > >> > > > >> > > rsync blocks in D state, and to kill it, I have to do umount --lazy > >> > > on > >> > > GlusterFS mountpoint, and then kill corresponding client glusterfs > >> > > process. Then rsync exits. > >> > > > >> > > Here is GlusterFS volume info: > >> > > > >> > > === > >> > > Volume Name: asterisk_records > >> > > Type: Distributed-Replicate > >> > > Volume ID: dc1fe561-fa3a-4f2e-8330-ec7e52c75ba4 > >> > > Status: Started > >> > > Number of Bricks: 3 x 2 = 6 > >> > > Transport-type: tcp > >> > > Bricks: > >> > > Brick1: > >> > > server1:/bricks/10_megaraid_0_3_9_x_0_4_3_hdd_r1_nolvm_hdd_storage_01 > >> > > /as > >> > > te > >> > > risk/records Brick2: > >> > > server2:/bricks/10_megaraid_8_5_14_x_8_6_16_hdd_r1_nolvm_hdd_storage_ > >> > > 01/ > >> > > as > >> > > terisk/records Brick3: > >> > > server1:/bricks/11_megaraid_0_5_4_x_0_6_5_hdd_r1_nolvm_hdd_storage_02 > >> > > /as > >> > > te > >> > > risk/records Brick4: > >>
Re: [Gluster-devel] [Gluster-users] GlusterFS FUSE client hangs on rsyncing lots of file
Wrong assumption, rsync hung again. On субота, 16 січня 2016 р. 22:53:04 EET Oleksandr Natalenko wrote: > One possible reason: > > cluster.lookup-optimize: on > cluster.readdir-optimize: on > > I've disabled both optimizations, and at least as of now rsync still does > its job with no issues. I would like to find out what option causes such a > behavior and why. Will test more. > > On пʼятниця, 15 січня 2016 р. 16:09:51 EET Oleksandr Natalenko wrote: > > Another observation: if rsyncing is resumed after hang, rsync itself > > hangs a lot faster because it does stat of already copied files. So, the > > reason may be not writing itself, but massive stat on GlusterFS volume > > as well. > > > > 15.01.2016 09:40, Oleksandr Natalenko написав: > > > While doing rsync over millions of files from ordinary partition to > > > GlusterFS volume, just after approx. first 2 million rsync hang > > > happens, and the following info appears in dmesg: > > > > > > === > > > [17075038.924481] INFO: task rsync:10310 blocked for more than 120 > > > seconds. > > > [17075038.931948] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > > > disables this message. > > > [17075038.940748] rsync D 88207fc13680 0 10310 > > > 10309 0x0080 > > > [17075038.940752] 8809c578be18 0086 8809c578bfd8 > > > 00013680 > > > [17075038.940756] 8809c578bfd8 00013680 880310cbe660 > > > 881159d16a30 > > > [17075038.940759] 881e3aa25800 8809c578be48 881159d16b10 > > > 88087d553980 > > > [17075038.940762] Call Trace: > > > [17075038.940770] [] schedule+0x29/0x70 > > > [17075038.940797] [] __fuse_request_send+0x13d/0x2c0 > > > [fuse] > > > [17075038.940801] [] ? > > > fuse_get_req_nofail_nopages+0xc0/0x1e0 [fuse] > > > [17075038.940805] [] ? wake_up_bit+0x30/0x30 > > > [17075038.940809] [] fuse_request_send+0x12/0x20 > > > [fuse] > > > [17075038.940813] [] fuse_flush+0xff/0x150 [fuse] > > > [17075038.940817] [] filp_close+0x34/0x80 > > > [17075038.940821] [] __close_fd+0x78/0xa0 > > > [17075038.940824] [] SyS_close+0x23/0x50 > > > [17075038.940828] [] system_call_fastpath+0x16/0x1b > > > === > > > > > > rsync blocks in D state, and to kill it, I have to do umount --lazy on > > > GlusterFS mountpoint, and then kill corresponding client glusterfs > > > process. Then rsync exits. > > > > > > Here is GlusterFS volume info: > > > > > > === > > > Volume Name: asterisk_records > > > Type: Distributed-Replicate > > > Volume ID: dc1fe561-fa3a-4f2e-8330-ec7e52c75ba4 > > > Status: Started > > > Number of Bricks: 3 x 2 = 6 > > > Transport-type: tcp > > > Bricks: > > > Brick1: > > > server1:/bricks/10_megaraid_0_3_9_x_0_4_3_hdd_r1_nolvm_hdd_storage_01/as > > > te > > > risk/records Brick2: > > > server2:/bricks/10_megaraid_8_5_14_x_8_6_16_hdd_r1_nolvm_hdd_storage_01/ > > > as > > > terisk/records Brick3: > > > server1:/bricks/11_megaraid_0_5_4_x_0_6_5_hdd_r1_nolvm_hdd_storage_02/as > > > te > > > risk/records Brick4: > > > server2:/bricks/11_megaraid_8_7_15_x_8_8_20_hdd_r1_nolvm_hdd_storage_02/ > > > as > > > terisk/records Brick5: > > > server1:/bricks/12_megaraid_0_7_6_x_0_13_14_hdd_r1_nolvm_hdd_storage_03/ > > > as > > > terisk/records Brick6: > > > server2:/bricks/12_megaraid_8_9_19_x_8_13_24_hdd_r1_nolvm_hdd_storage_03 > > > /a > > > sterisk/records Options Reconfigured: > > > cluster.lookup-optimize: on > > > cluster.readdir-optimize: on > > > client.event-threads: 2 > > > network.inode-lru-limit: 4096 > > > server.event-threads: 4 > > > performance.client-io-threads: on > > > storage.linux-aio: on > > > performance.write-behind-window-size: 4194304 > > > performance.stat-prefetch: on > > > performance.quick-read: on > > > performance.read-ahead: on > > > performance.flush-behind: on > > > performance.write-behind: on > > > performance.io-thread-count: 2 > > > performance.cache-max-file-size: 1048576 > > > performance.cache-size: 33554432 > > > features.cache-invalidation: on > > > performance.readdir-ahead: on > > > === > > > > > > The issue reproduces each time I rsync such an amount of files. > > > > > > How could I debug this issue better? > > > ___ > > > Gluster-users mailing list > > > gluster-us...@gluster.org > > > http://www.gluster.org/mailman/listinfo/gluster-users > > > > ___ > > Gluster-devel mailing list > > Gluster-devel@gluster.org > > http://www.gluster.org/mailman/listinfo/gluster-devel > > ___ > Gluster-devel mailing list > Gluster-devel@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] GlusterFS FUSE client hangs on rsyncing lots of file
One possible reason: cluster.lookup-optimize: on cluster.readdir-optimize: on I've disabled both optimizations, and at least as of now rsync still does its job with no issues. I would like to find out what option causes such a behavior and why. Will test more. On пʼятниця, 15 січня 2016 р. 16:09:51 EET Oleksandr Natalenko wrote: > Another observation: if rsyncing is resumed after hang, rsync itself > hangs a lot faster because it does stat of already copied files. So, the > reason may be not writing itself, but massive stat on GlusterFS volume > as well. > > 15.01.2016 09:40, Oleksandr Natalenko написав: > > While doing rsync over millions of files from ordinary partition to > > GlusterFS volume, just after approx. first 2 million rsync hang > > happens, and the following info appears in dmesg: > > > > === > > [17075038.924481] INFO: task rsync:10310 blocked for more than 120 > > seconds. > > [17075038.931948] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > > disables this message. > > [17075038.940748] rsync D 88207fc13680 0 10310 > > 10309 0x0080 > > [17075038.940752] 8809c578be18 0086 8809c578bfd8 > > 00013680 > > [17075038.940756] 8809c578bfd8 00013680 880310cbe660 > > 881159d16a30 > > [17075038.940759] 881e3aa25800 8809c578be48 881159d16b10 > > 88087d553980 > > [17075038.940762] Call Trace: > > [17075038.940770] [] schedule+0x29/0x70 > > [17075038.940797] [] __fuse_request_send+0x13d/0x2c0 > > [fuse] > > [17075038.940801] [] ? > > fuse_get_req_nofail_nopages+0xc0/0x1e0 [fuse] > > [17075038.940805] [] ? wake_up_bit+0x30/0x30 > > [17075038.940809] [] fuse_request_send+0x12/0x20 > > [fuse] > > [17075038.940813] [] fuse_flush+0xff/0x150 [fuse] > > [17075038.940817] [] filp_close+0x34/0x80 > > [17075038.940821] [] __close_fd+0x78/0xa0 > > [17075038.940824] [] SyS_close+0x23/0x50 > > [17075038.940828] [] system_call_fastpath+0x16/0x1b > > === > > > > rsync blocks in D state, and to kill it, I have to do umount --lazy on > > GlusterFS mountpoint, and then kill corresponding client glusterfs > > process. Then rsync exits. > > > > Here is GlusterFS volume info: > > > > === > > Volume Name: asterisk_records > > Type: Distributed-Replicate > > Volume ID: dc1fe561-fa3a-4f2e-8330-ec7e52c75ba4 > > Status: Started > > Number of Bricks: 3 x 2 = 6 > > Transport-type: tcp > > Bricks: > > Brick1: > > server1:/bricks/10_megaraid_0_3_9_x_0_4_3_hdd_r1_nolvm_hdd_storage_01/aste > > risk/records Brick2: > > server2:/bricks/10_megaraid_8_5_14_x_8_6_16_hdd_r1_nolvm_hdd_storage_01/as > > terisk/records Brick3: > > server1:/bricks/11_megaraid_0_5_4_x_0_6_5_hdd_r1_nolvm_hdd_storage_02/aste > > risk/records Brick4: > > server2:/bricks/11_megaraid_8_7_15_x_8_8_20_hdd_r1_nolvm_hdd_storage_02/as > > terisk/records Brick5: > > server1:/bricks/12_megaraid_0_7_6_x_0_13_14_hdd_r1_nolvm_hdd_storage_03/as > > terisk/records Brick6: > > server2:/bricks/12_megaraid_8_9_19_x_8_13_24_hdd_r1_nolvm_hdd_storage_03/a > > sterisk/records Options Reconfigured: > > cluster.lookup-optimize: on > > cluster.readdir-optimize: on > > client.event-threads: 2 > > network.inode-lru-limit: 4096 > > server.event-threads: 4 > > performance.client-io-threads: on > > storage.linux-aio: on > > performance.write-behind-window-size: 4194304 > > performance.stat-prefetch: on > > performance.quick-read: on > > performance.read-ahead: on > > performance.flush-behind: on > > performance.write-behind: on > > performance.io-thread-count: 2 > > performance.cache-max-file-size: 1048576 > > performance.cache-size: 33554432 > > features.cache-invalidation: on > > performance.readdir-ahead: on > > === > > > > The issue reproduces each time I rsync such an amount of files. > > > > How could I debug this issue better? > > ___ > > Gluster-users mailing list > > gluster-us...@gluster.org > > http://www.gluster.org/mailman/listinfo/gluster-users > > ___ > Gluster-devel mailing list > Gluster-devel@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] GlusterFS FUSE client hangs on rsyncing lots of file
Another observation: if rsyncing is resumed after hang, rsync itself hangs a lot faster because it does stat of already copied files. So, the reason may be not writing itself, but massive stat on GlusterFS volume as well. 15.01.2016 09:40, Oleksandr Natalenko написав: While doing rsync over millions of files from ordinary partition to GlusterFS volume, just after approx. first 2 million rsync hang happens, and the following info appears in dmesg: === [17075038.924481] INFO: task rsync:10310 blocked for more than 120 seconds. [17075038.931948] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [17075038.940748] rsync D 88207fc13680 0 10310 10309 0x0080 [17075038.940752] 8809c578be18 0086 8809c578bfd8 00013680 [17075038.940756] 8809c578bfd8 00013680 880310cbe660 881159d16a30 [17075038.940759] 881e3aa25800 8809c578be48 881159d16b10 88087d553980 [17075038.940762] Call Trace: [17075038.940770] [] schedule+0x29/0x70 [17075038.940797] [] __fuse_request_send+0x13d/0x2c0 [fuse] [17075038.940801] [] ? fuse_get_req_nofail_nopages+0xc0/0x1e0 [fuse] [17075038.940805] [] ? wake_up_bit+0x30/0x30 [17075038.940809] [] fuse_request_send+0x12/0x20 [fuse] [17075038.940813] [] fuse_flush+0xff/0x150 [fuse] [17075038.940817] [] filp_close+0x34/0x80 [17075038.940821] [] __close_fd+0x78/0xa0 [17075038.940824] [] SyS_close+0x23/0x50 [17075038.940828] [] system_call_fastpath+0x16/0x1b === rsync blocks in D state, and to kill it, I have to do umount --lazy on GlusterFS mountpoint, and then kill corresponding client glusterfs process. Then rsync exits. Here is GlusterFS volume info: === Volume Name: asterisk_records Type: Distributed-Replicate Volume ID: dc1fe561-fa3a-4f2e-8330-ec7e52c75ba4 Status: Started Number of Bricks: 3 x 2 = 6 Transport-type: tcp Bricks: Brick1: server1:/bricks/10_megaraid_0_3_9_x_0_4_3_hdd_r1_nolvm_hdd_storage_01/asterisk/records Brick2: server2:/bricks/10_megaraid_8_5_14_x_8_6_16_hdd_r1_nolvm_hdd_storage_01/asterisk/records Brick3: server1:/bricks/11_megaraid_0_5_4_x_0_6_5_hdd_r1_nolvm_hdd_storage_02/asterisk/records Brick4: server2:/bricks/11_megaraid_8_7_15_x_8_8_20_hdd_r1_nolvm_hdd_storage_02/asterisk/records Brick5: server1:/bricks/12_megaraid_0_7_6_x_0_13_14_hdd_r1_nolvm_hdd_storage_03/asterisk/records Brick6: server2:/bricks/12_megaraid_8_9_19_x_8_13_24_hdd_r1_nolvm_hdd_storage_03/asterisk/records Options Reconfigured: cluster.lookup-optimize: on cluster.readdir-optimize: on client.event-threads: 2 network.inode-lru-limit: 4096 server.event-threads: 4 performance.client-io-threads: on storage.linux-aio: on performance.write-behind-window-size: 4194304 performance.stat-prefetch: on performance.quick-read: on performance.read-ahead: on performance.flush-behind: on performance.write-behind: on performance.io-thread-count: 2 performance.cache-max-file-size: 1048576 performance.cache-size: 33554432 features.cache-invalidation: on performance.readdir-ahead: on === The issue reproduces each time I rsync such an amount of files. How could I debug this issue better? ___ Gluster-users mailing list gluster-us...@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] GlusterFS FUSE client hangs on rsyncing lots of file
Here is similar issue described on serverfault.com: https://serverfault.com/questions/716410/rsync-crashes-machine-while-performing-sync-on-glusterfs-mounted-share I've checked GlusterFS logs with no luck — as if nothing happened. P.S. GlusterFS v3.7.6. 15.01.2016 09:40, Oleksandr Natalenko написав: While doing rsync over millions of files from ordinary partition to GlusterFS volume, just after approx. first 2 million rsync hang happens, and the following info appears in dmesg: === [17075038.924481] INFO: task rsync:10310 blocked for more than 120 seconds. [17075038.931948] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [17075038.940748] rsync D 88207fc13680 0 10310 10309 0x0080 [17075038.940752] 8809c578be18 0086 8809c578bfd8 00013680 [17075038.940756] 8809c578bfd8 00013680 880310cbe660 881159d16a30 [17075038.940759] 881e3aa25800 8809c578be48 881159d16b10 88087d553980 [17075038.940762] Call Trace: [17075038.940770] [] schedule+0x29/0x70 [17075038.940797] [] __fuse_request_send+0x13d/0x2c0 [fuse] [17075038.940801] [] ? fuse_get_req_nofail_nopages+0xc0/0x1e0 [fuse] [17075038.940805] [] ? wake_up_bit+0x30/0x30 [17075038.940809] [] fuse_request_send+0x12/0x20 [fuse] [17075038.940813] [] fuse_flush+0xff/0x150 [fuse] [17075038.940817] [] filp_close+0x34/0x80 [17075038.940821] [] __close_fd+0x78/0xa0 [17075038.940824] [] SyS_close+0x23/0x50 [17075038.940828] [] system_call_fastpath+0x16/0x1b === rsync blocks in D state, and to kill it, I have to do umount --lazy on GlusterFS mountpoint, and then kill corresponding client glusterfs process. Then rsync exits. Here is GlusterFS volume info: === Volume Name: asterisk_records Type: Distributed-Replicate Volume ID: dc1fe561-fa3a-4f2e-8330-ec7e52c75ba4 Status: Started Number of Bricks: 3 x 2 = 6 Transport-type: tcp Bricks: Brick1: server1:/bricks/10_megaraid_0_3_9_x_0_4_3_hdd_r1_nolvm_hdd_storage_01/asterisk/records Brick2: server2:/bricks/10_megaraid_8_5_14_x_8_6_16_hdd_r1_nolvm_hdd_storage_01/asterisk/records Brick3: server1:/bricks/11_megaraid_0_5_4_x_0_6_5_hdd_r1_nolvm_hdd_storage_02/asterisk/records Brick4: server2:/bricks/11_megaraid_8_7_15_x_8_8_20_hdd_r1_nolvm_hdd_storage_02/asterisk/records Brick5: server1:/bricks/12_megaraid_0_7_6_x_0_13_14_hdd_r1_nolvm_hdd_storage_03/asterisk/records Brick6: server2:/bricks/12_megaraid_8_9_19_x_8_13_24_hdd_r1_nolvm_hdd_storage_03/asterisk/records Options Reconfigured: cluster.lookup-optimize: on cluster.readdir-optimize: on client.event-threads: 2 network.inode-lru-limit: 4096 server.event-threads: 4 performance.client-io-threads: on storage.linux-aio: on performance.write-behind-window-size: 4194304 performance.stat-prefetch: on performance.quick-read: on performance.read-ahead: on performance.flush-behind: on performance.write-behind: on performance.io-thread-count: 2 performance.cache-max-file-size: 1048576 performance.cache-size: 33554432 features.cache-invalidation: on performance.readdir-ahead: on === The issue reproduces each time I rsync such an amount of files. How could I debug this issue better? ___ Gluster-users mailing list gluster-us...@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] GlusterFS FUSE client hangs on rsyncing lots of file
While doing rsync over millions of files from ordinary partition to GlusterFS volume, just after approx. first 2 million rsync hang happens, and the following info appears in dmesg: === [17075038.924481] INFO: task rsync:10310 blocked for more than 120 seconds. [17075038.931948] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [17075038.940748] rsync D 88207fc13680 0 10310 10309 0x0080 [17075038.940752] 8809c578be18 0086 8809c578bfd8 00013680 [17075038.940756] 8809c578bfd8 00013680 880310cbe660 881159d16a30 [17075038.940759] 881e3aa25800 8809c578be48 881159d16b10 88087d553980 [17075038.940762] Call Trace: [17075038.940770] [] schedule+0x29/0x70 [17075038.940797] [] __fuse_request_send+0x13d/0x2c0 [fuse] [17075038.940801] [] ? fuse_get_req_nofail_nopages+0xc0/0x1e0 [fuse] [17075038.940805] [] ? wake_up_bit+0x30/0x30 [17075038.940809] [] fuse_request_send+0x12/0x20 [fuse] [17075038.940813] [] fuse_flush+0xff/0x150 [fuse] [17075038.940817] [] filp_close+0x34/0x80 [17075038.940821] [] __close_fd+0x78/0xa0 [17075038.940824] [] SyS_close+0x23/0x50 [17075038.940828] [] system_call_fastpath+0x16/0x1b === rsync blocks in D state, and to kill it, I have to do umount --lazy on GlusterFS mountpoint, and then kill corresponding client glusterfs process. Then rsync exits. Here is GlusterFS volume info: === Volume Name: asterisk_records Type: Distributed-Replicate Volume ID: dc1fe561-fa3a-4f2e-8330-ec7e52c75ba4 Status: Started Number of Bricks: 3 x 2 = 6 Transport-type: tcp Bricks: Brick1: server1:/bricks/10_megaraid_0_3_9_x_0_4_3_hdd_r1_nolvm_hdd_storage_01/asterisk/records Brick2: server2:/bricks/10_megaraid_8_5_14_x_8_6_16_hdd_r1_nolvm_hdd_storage_01/asterisk/records Brick3: server1:/bricks/11_megaraid_0_5_4_x_0_6_5_hdd_r1_nolvm_hdd_storage_02/asterisk/records Brick4: server2:/bricks/11_megaraid_8_7_15_x_8_8_20_hdd_r1_nolvm_hdd_storage_02/asterisk/records Brick5: server1:/bricks/12_megaraid_0_7_6_x_0_13_14_hdd_r1_nolvm_hdd_storage_03/asterisk/records Brick6: server2:/bricks/12_megaraid_8_9_19_x_8_13_24_hdd_r1_nolvm_hdd_storage_03/asterisk/records Options Reconfigured: cluster.lookup-optimize: on cluster.readdir-optimize: on client.event-threads: 2 network.inode-lru-limit: 4096 server.event-threads: 4 performance.client-io-threads: on storage.linux-aio: on performance.write-behind-window-size: 4194304 performance.stat-prefetch: on performance.quick-read: on performance.read-ahead: on performance.flush-behind: on performance.write-behind: on performance.io-thread-count: 2 performance.cache-max-file-size: 1048576 performance.cache-size: 33554432 features.cache-invalidation: on performance.readdir-ahead: on === The issue reproduces each time I rsync such an amount of files. How could I debug this issue better? ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] Memory leak in GlusterFS FUSE client
I've applied client_cbk_cache_invalidation leak patch, and here are the results. Launch: === valgrind --leak-check=full --show-leak-kinds=all --log-file="valgrind_fuse.log" /usr/bin/glusterfs -N --volfile-server=server.example.com --volfile-id=somevolume /mnt/somevolume find /mnt/somevolume -type d === During the traversing, memory RSS value for glusterfs process went from 79M to 644M. Then I performed dropping VFS cache (as I did in previous tests), but RSS value was not affected. Then I did statedump: https://gist.github.com/11c7b11fc99ab123e6e2 Then I unmounted the volume and got Valgrind log: https://gist.github.com/99d2e3c5cb4ed50b091c Leaks reported by Valgrind do not conform by their size to overall runtime memory consumption, so I believe with the latest patch some cleanup is being performed better on exit (unmount), but in runtime there are still some issues. 13.01.2016 12:56, Soumya Koduri написав: On 01/13/2016 04:08 PM, Soumya Koduri wrote: On 01/12/2016 12:46 PM, Oleksandr Natalenko wrote: Just in case, here is Valgrind output on FUSE client with 3.7.6 + API-related patches we discussed before: https://gist.github.com/cd6605ca19734c1496a4 Thanks for sharing the results. I made changes to fix one leak reported there wrt ' client_cbk_cache_invalidation' - - http://review.gluster.org/#/c/13232/ The other inode* related memory reported as lost is mainly (maybe) because fuse client process doesn't cleanup its memory (doesn't use fini()) while exiting the process. Hence majority of those allocations are listed as lost. But most of the inodes should have got purged when we drop vfs cache. Did you do drop vfs cache before exiting the process? I shall add some log statements and check that part Also please take statedump of the fuse mount process (after dropping vfs cache) when you see high memory usage by issuing the following command - 'kill -USR1 ' The statedump will be copied to 'glusterdump..dump.tim estamp` file in /var/run/gluster or /usr/local/var/run/gluster. Please refer to [1] for more information. Thanks, Soumya [1] http://review.gluster.org/#/c/8288/1/doc/debugging/statedump.md Thanks, Soumya 12.01.2016 08:24, Soumya Koduri написав: For fuse client, I tried vfs drop_caches as suggested by Vijay in an earlier mail. Though all the inodes get purged, I still doesn't see much difference in the memory footprint drop. Need to investigate what else is consuming so much memory here. ___ Gluster-users mailing list gluster-us...@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] Memory leak in GlusterFS FUSE client
Just in case, here is Valgrind output on FUSE client with 3.7.6 + API-related patches we discussed before: https://gist.github.com/cd6605ca19734c1496a4 12.01.2016 08:24, Soumya Koduri написав: For fuse client, I tried vfs drop_caches as suggested by Vijay in an earlier mail. Though all the inodes get purged, I still doesn't see much difference in the memory footprint drop. Need to investigate what else is consuming so much memory here. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] Memory leak in GlusterFS FUSE client
Brief test shows that Ganesha stopped leaking and crashing, so it seems to be good for me. Nevertheless, back to my original question: what about FUSE client? It is still leaking despite all the fixes applied. Should it be considered another issue? 11.01.2016 12:26, Soumya Koduri написав: I have made changes to fix the lookup leak in a different way (as discussed with Pranith) and uploaded them in the latest patch set #4 - http://review.gluster.org/#/c/13096/ Please check if it resolves the mem leak and hopefully doesn't result in any assertion :) Thanks, Soumya On 01/08/2016 05:04 PM, Soumya Koduri wrote: I could reproduce while testing deep directories with in the mount point. I root caus'ed the issue & had discussion with Pranith to understand the purpose and recommended way of taking nlookup on inodes. I shall make changes to my existing fix and post the patch soon. Thanks for your patience! -Soumya On 01/07/2016 07:34 PM, Oleksandr Natalenko wrote: OK, I've patched GlusterFS v3.7.6 with 43570a01 and 5cffb56b (the most recent revisions) and NFS-Ganesha v2.3.0 with 8685abfc (most recent revision too). On traversing GlusterFS volume with many files in one folder via NFS mount I get an assertion: === ganesha.nfsd: inode.c:716: __inode_forget: Assertion `inode->nlookup >= nlookup' failed. === I used GDB on NFS-Ganesha process to get appropriate stacktraces: 1. short stacktrace of failed thread: https://gist.github.com/7f63bb99c530d26ded18 2. full stacktrace of failed thread: https://gist.github.com/d9bc7bc8f6a0bbff9e86 3. short stacktrace of all threads: https://gist.github.com/f31da7725306854c719f 4. full stacktrace of all threads: https://gist.github.com/65cbc562b01211ea5612 GlusterFS volume configuration: https://gist.github.com/30f0129d16e25d4a5a52 ganesha.conf: https://gist.github.com/9b5e59b8d6d8cb84c85d How I mount NFS share: === mount -t nfs4 127.0.0.1:/mail_boxes /mnt/tmp -o defaults,_netdev,minorversion=2,noac,noacl,lookupcache=none,timeo=100 === On четвер, 7 січня 2016 р. 12:06:42 EET Soumya Koduri wrote: Entries_HWMark = 500; ___ Gluster-users mailing list gluster-us...@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] Memory leak in GlusterFS FUSE client
OK, I've patched GlusterFS v3.7.6 with 43570a01 and 5cffb56b (the most recent revisions) and NFS-Ganesha v2.3.0 with 8685abfc (most recent revision too). On traversing GlusterFS volume with many files in one folder via NFS mount I get an assertion: === ganesha.nfsd: inode.c:716: __inode_forget: Assertion `inode->nlookup >= nlookup' failed. === I used GDB on NFS-Ganesha process to get appropriate stacktraces: 1. short stacktrace of failed thread: https://gist.github.com/7f63bb99c530d26ded18 2. full stacktrace of failed thread: https://gist.github.com/d9bc7bc8f6a0bbff9e86 3. short stacktrace of all threads: https://gist.github.com/f31da7725306854c719f 4. full stacktrace of all threads: https://gist.github.com/65cbc562b01211ea5612 GlusterFS volume configuration: https://gist.github.com/30f0129d16e25d4a5a52 ganesha.conf: https://gist.github.com/9b5e59b8d6d8cb84c85d How I mount NFS share: === mount -t nfs4 127.0.0.1:/mail_boxes /mnt/tmp -o defaults,_netdev,minorversion=2,noac,noacl,lookupcache=none,timeo=100 === On четвер, 7 січня 2016 р. 12:06:42 EET Soumya Koduri wrote: > Entries_HWMark = 500; ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] Memory leak in GlusterFS FUSE client
OK, here is valgrind log of patched Ganesha (I took recent version of your patchset, 8685abfc6d) with Entries_HWMARK set to 500. https://gist.github.com/5397c152a259b9600af0 See no huge runtime leaks now. However, I've repeated this test with another volume in replica and got the following Ganesha error: === ganesha.nfsd: inode.c:716: __inode_forget: Assertion `inode->nlookup >= nlookup' failed. === 06.01.2016 08:40, Soumya Koduri написав: On 01/06/2016 03:53 AM, Oleksandr Natalenko wrote: OK, I've repeated the same traversing test with patched GlusterFS API, and here is new Valgrind log: https://gist.github.com/17ecb16a11c9aed957f5 Fuse mount doesn't use gfapi helper. Does your above GlusterFS API application call glfs_fini() during exit? glfs_fini() is responsible for freeing the memory consumed by gfAPI applications. Could you repeat the test with nfs-ganesha (which for sure calls glfs_fini() and purges inodes if exceeds its inode cache limit) if possible. Thanks, Soumya Still leaks. On вівторок, 5 січня 2016 р. 22:52:25 EET Soumya Koduri wrote: On 01/05/2016 05:56 PM, Oleksandr Natalenko wrote: Unfortunately, both patches didn't make any difference for me. I've patched 3.7.6 with both patches, recompiled and installed patched GlusterFS package on client side and mounted volume with ~2M of files. The I performed usual tree traverse with simple "find". Memory RES value went from ~130M at the moment of mounting to ~1.5G after traversing the volume for ~40 mins. Valgrind log still shows lots of leaks. Here it is: https://gist.github.com/56906ca6e657c4ffa4a1 Looks like you had done fuse mount. The patches which I have pasted below apply to gfapi/nfs-ganesha applications. Also, to resolve the nfs-ganesha issue which I had mentioned below (in case if Entries_HWMARK option gets changed), I have posted below fix - https://review.gerrithub.io/#/c/258687 Thanks, Soumya Ideas? 05.01.2016 12:31, Soumya Koduri написав: I tried to debug the inode* related leaks and seen some improvements after applying the below patches when ran the same test (but will smaller load). Could you please apply those patches & confirm the same? a) http://review.gluster.org/13125 This will fix the inodes & their ctx related leaks during unexport and the program exit. Please check the valgrind output after applying the patch. It should not list any inodes related memory as lost. b) http://review.gluster.org/13096 The reason the change in Entries_HWMARK (in your earlier mail) dint have much effect is that the inode_nlookup count doesn't become zero for those handles/inodes being closed by ganesha. Hence those inodes shall get added to inode lru list instead of purge list which shall get forcefully purged only when the number of gfapi inode table entries reaches its limit (which is 137012). This patch fixes those 'nlookup' counts. Please apply this patch and reduce 'Entries_HWMARK' to much lower value and check if it decreases the in-memory being consumed by ganesha process while being active. CACHEINODE { Entries_HWMark = 500; } Note: I see an issue with nfs-ganesha during exit when the option 'Entries_HWMARK' gets changed. This is not related to any of the above patches (or rather Gluster) and I am currently debugging it. Thanks, Soumya On 12/25/2015 11:34 PM, Oleksandr Natalenko wrote: 1. test with Cache_Size = 256 and Entries_HWMark = 4096 Before find . -type f: root 3120 0.6 11.0 879120 208408 ? Ssl 17:39 0:00 /usr/bin/ ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_EVENT After: root 3120 11.4 24.3 1170076 458168 ? Ssl 17:39 13:39 /usr/bin/ ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_EVENT ~250M leak. 2. test with default values (after ganesha restart) Before: root 24937 1.3 10.4 875016 197808 ? Ssl 19:39 0:00 /usr/bin/ ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_EVENT After: root 24937 3.5 18.9 1022544 356340 ? Ssl 19:39 0:40 /usr/bin/ ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_EVENT ~159M leak. No reasonable correlation detected. Second test was finished much faster than first (I guess, server-side GlusterFS cache or server kernel page cache is the cause). There are ~1.8M files on this test volume. On пʼятниця, 25 грудня 2015 р. 20:28:13 EET Soumya Koduri wrote: On 12/24/2015 09:17 PM, Oleksandr Natalenko wrote: Another addition: it seems to be GlusterFS API library memory leak because NFS-Ganesha also consumes huge amount of memory while doing ordinary "find . -type f" via NFSv4.2 on remote client. Here is memory usage: === root 5416 34.2 78.5 2047176 1480552 ? Ssl 12:02 117:54 /usr/bin/ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha
Re: [Gluster-devel] [Gluster-users] Memory leak in GlusterFS FUSE client
OK, I've repeated the same traversing test with patched GlusterFS API, and here is new Valgrind log: https://gist.github.com/17ecb16a11c9aed957f5 Still leaks. On вівторок, 5 січня 2016 р. 22:52:25 EET Soumya Koduri wrote: > On 01/05/2016 05:56 PM, Oleksandr Natalenko wrote: > > Unfortunately, both patches didn't make any difference for me. > > > > I've patched 3.7.6 with both patches, recompiled and installed patched > > GlusterFS package on client side and mounted volume with ~2M of files. > > The I performed usual tree traverse with simple "find". > > > > Memory RES value went from ~130M at the moment of mounting to ~1.5G > > after traversing the volume for ~40 mins. Valgrind log still shows lots > > of leaks. Here it is: > > > > https://gist.github.com/56906ca6e657c4ffa4a1 > > Looks like you had done fuse mount. The patches which I have pasted > below apply to gfapi/nfs-ganesha applications. > > Also, to resolve the nfs-ganesha issue which I had mentioned below (in > case if Entries_HWMARK option gets changed), I have posted below fix - > https://review.gerrithub.io/#/c/258687 > > Thanks, > Soumya > > > Ideas? > > > > 05.01.2016 12:31, Soumya Koduri написав: > >> I tried to debug the inode* related leaks and seen some improvements > >> after applying the below patches when ran the same test (but will > >> smaller load). Could you please apply those patches & confirm the > >> same? > >> > >> a) http://review.gluster.org/13125 > >> > >> This will fix the inodes & their ctx related leaks during unexport and > >> the program exit. Please check the valgrind output after applying the > >> patch. It should not list any inodes related memory as lost. > >> > >> b) http://review.gluster.org/13096 > >> > >> The reason the change in Entries_HWMARK (in your earlier mail) dint > >> have much effect is that the inode_nlookup count doesn't become zero > >> for those handles/inodes being closed by ganesha. Hence those inodes > >> shall get added to inode lru list instead of purge list which shall > >> get forcefully purged only when the number of gfapi inode table > >> entries reaches its limit (which is 137012). > >> > >> This patch fixes those 'nlookup' counts. Please apply this patch and > >> reduce 'Entries_HWMARK' to much lower value and check if it decreases > >> the in-memory being consumed by ganesha process while being active. > >> > >> CACHEINODE { > >> > >> Entries_HWMark = 500; > >> > >> } > >> > >> > >> Note: I see an issue with nfs-ganesha during exit when the option > >> 'Entries_HWMARK' gets changed. This is not related to any of the above > >> patches (or rather Gluster) and I am currently debugging it. > >> > >> Thanks, > >> Soumya > >> > >> On 12/25/2015 11:34 PM, Oleksandr Natalenko wrote: > >>> 1. test with Cache_Size = 256 and Entries_HWMark = 4096 > >>> > >>> Before find . -type f: > >>> > >>> root 3120 0.6 11.0 879120 208408 ? Ssl 17:39 0:00 > >>> /usr/bin/ > >>> ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N > >>> NIV_EVENT > >>> > >>> After: > >>> > >>> root 3120 11.4 24.3 1170076 458168 ? Ssl 17:39 13:39 > >>> /usr/bin/ > >>> ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N > >>> NIV_EVENT > >>> > >>> ~250M leak. > >>> > >>> 2. test with default values (after ganesha restart) > >>> > >>> Before: > >>> > >>> root 24937 1.3 10.4 875016 197808 ? Ssl 19:39 0:00 > >>> /usr/bin/ > >>> ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N > >>> NIV_EVENT > >>> > >>> After: > >>> > >>> root 24937 3.5 18.9 1022544 356340 ? Ssl 19:39 0:40 > >>> /usr/bin/ > >>> ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N > >>> NIV_EVENT > >>> > >>> ~159M leak. > >>> > >>> No reasonable correlation detected. Second test was finished much > >>> faster than > >>> first (I guess, server-side GlusterFS cache or server kernel page > >>> cache
Re: [Gluster-devel] [Gluster-users] Memory leak in GlusterFS FUSE client
Correct, I used FUSE mount. Shouldn't gfapi be used by FUSE mount helper (/ usr/bin/glusterfs)? On вівторок, 5 січня 2016 р. 22:52:25 EET Soumya Koduri wrote: > On 01/05/2016 05:56 PM, Oleksandr Natalenko wrote: > > Unfortunately, both patches didn't make any difference for me. > > > > I've patched 3.7.6 with both patches, recompiled and installed patched > > GlusterFS package on client side and mounted volume with ~2M of files. > > The I performed usual tree traverse with simple "find". > > > > Memory RES value went from ~130M at the moment of mounting to ~1.5G > > after traversing the volume for ~40 mins. Valgrind log still shows lots > > of leaks. Here it is: > > > > https://gist.github.com/56906ca6e657c4ffa4a1 > > Looks like you had done fuse mount. The patches which I have pasted > below apply to gfapi/nfs-ganesha applications. > > Also, to resolve the nfs-ganesha issue which I had mentioned below (in > case if Entries_HWMARK option gets changed), I have posted below fix - > https://review.gerrithub.io/#/c/258687 > > Thanks, > Soumya > > > Ideas? > > > > 05.01.2016 12:31, Soumya Koduri написав: > >> I tried to debug the inode* related leaks and seen some improvements > >> after applying the below patches when ran the same test (but will > >> smaller load). Could you please apply those patches & confirm the > >> same? > >> > >> a) http://review.gluster.org/13125 > >> > >> This will fix the inodes & their ctx related leaks during unexport and > >> the program exit. Please check the valgrind output after applying the > >> patch. It should not list any inodes related memory as lost. > >> > >> b) http://review.gluster.org/13096 > >> > >> The reason the change in Entries_HWMARK (in your earlier mail) dint > >> have much effect is that the inode_nlookup count doesn't become zero > >> for those handles/inodes being closed by ganesha. Hence those inodes > >> shall get added to inode lru list instead of purge list which shall > >> get forcefully purged only when the number of gfapi inode table > >> entries reaches its limit (which is 137012). > >> > >> This patch fixes those 'nlookup' counts. Please apply this patch and > >> reduce 'Entries_HWMARK' to much lower value and check if it decreases > >> the in-memory being consumed by ganesha process while being active. > >> > >> CACHEINODE { > >> > >> Entries_HWMark = 500; > >> > >> } > >> > >> > >> Note: I see an issue with nfs-ganesha during exit when the option > >> 'Entries_HWMARK' gets changed. This is not related to any of the above > >> patches (or rather Gluster) and I am currently debugging it. > >> > >> Thanks, > >> Soumya > >> > >> On 12/25/2015 11:34 PM, Oleksandr Natalenko wrote: > >>> 1. test with Cache_Size = 256 and Entries_HWMark = 4096 > >>> > >>> Before find . -type f: > >>> > >>> root 3120 0.6 11.0 879120 208408 ? Ssl 17:39 0:00 > >>> /usr/bin/ > >>> ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N > >>> NIV_EVENT > >>> > >>> After: > >>> > >>> root 3120 11.4 24.3 1170076 458168 ? Ssl 17:39 13:39 > >>> /usr/bin/ > >>> ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N > >>> NIV_EVENT > >>> > >>> ~250M leak. > >>> > >>> 2. test with default values (after ganesha restart) > >>> > >>> Before: > >>> > >>> root 24937 1.3 10.4 875016 197808 ? Ssl 19:39 0:00 > >>> /usr/bin/ > >>> ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N > >>> NIV_EVENT > >>> > >>> After: > >>> > >>> root 24937 3.5 18.9 1022544 356340 ? Ssl 19:39 0:40 > >>> /usr/bin/ > >>> ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N > >>> NIV_EVENT > >>> > >>> ~159M leak. > >>> > >>> No reasonable correlation detected. Second test was finished much > >>> faster than > >>> first (I guess, server-side GlusterFS cache or server kernel page > >>> cache is the > >>> cause). > >>>
Re: [Gluster-devel] [Gluster-users] Memory leak in GlusterFS FUSE client
Unfortunately, both patches didn't make any difference for me. I've patched 3.7.6 with both patches, recompiled and installed patched GlusterFS package on client side and mounted volume with ~2M of files. The I performed usual tree traverse with simple "find". Memory RES value went from ~130M at the moment of mounting to ~1.5G after traversing the volume for ~40 mins. Valgrind log still shows lots of leaks. Here it is: https://gist.github.com/56906ca6e657c4ffa4a1 Ideas? 05.01.2016 12:31, Soumya Koduri написав: I tried to debug the inode* related leaks and seen some improvements after applying the below patches when ran the same test (but will smaller load). Could you please apply those patches & confirm the same? a) http://review.gluster.org/13125 This will fix the inodes & their ctx related leaks during unexport and the program exit. Please check the valgrind output after applying the patch. It should not list any inodes related memory as lost. b) http://review.gluster.org/13096 The reason the change in Entries_HWMARK (in your earlier mail) dint have much effect is that the inode_nlookup count doesn't become zero for those handles/inodes being closed by ganesha. Hence those inodes shall get added to inode lru list instead of purge list which shall get forcefully purged only when the number of gfapi inode table entries reaches its limit (which is 137012). This patch fixes those 'nlookup' counts. Please apply this patch and reduce 'Entries_HWMARK' to much lower value and check if it decreases the in-memory being consumed by ganesha process while being active. CACHEINODE { Entries_HWMark = 500; } Note: I see an issue with nfs-ganesha during exit when the option 'Entries_HWMARK' gets changed. This is not related to any of the above patches (or rather Gluster) and I am currently debugging it. Thanks, Soumya On 12/25/2015 11:34 PM, Oleksandr Natalenko wrote: 1. test with Cache_Size = 256 and Entries_HWMark = 4096 Before find . -type f: root 3120 0.6 11.0 879120 208408 ? Ssl 17:39 0:00 /usr/bin/ ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_EVENT After: root 3120 11.4 24.3 1170076 458168 ? Ssl 17:39 13:39 /usr/bin/ ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_EVENT ~250M leak. 2. test with default values (after ganesha restart) Before: root 24937 1.3 10.4 875016 197808 ? Ssl 19:39 0:00 /usr/bin/ ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_EVENT After: root 24937 3.5 18.9 1022544 356340 ? Ssl 19:39 0:40 /usr/bin/ ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_EVENT ~159M leak. No reasonable correlation detected. Second test was finished much faster than first (I guess, server-side GlusterFS cache or server kernel page cache is the cause). There are ~1.8M files on this test volume. On пʼятниця, 25 грудня 2015 р. 20:28:13 EET Soumya Koduri wrote: On 12/24/2015 09:17 PM, Oleksandr Natalenko wrote: Another addition: it seems to be GlusterFS API library memory leak because NFS-Ganesha also consumes huge amount of memory while doing ordinary "find . -type f" via NFSv4.2 on remote client. Here is memory usage: === root 5416 34.2 78.5 2047176 1480552 ? Ssl 12:02 117:54 /usr/bin/ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_EVENT === 1.4G is too much for simple stat() :(. Ideas? nfs-ganesha also has cache layer which can scale to millions of entries depending on the number of files/directories being looked upon. However there are parameters to tune it. So either try stat with few entries or add below block in nfs-ganesha.conf file, set low limits and check the difference. That may help us narrow down how much memory actually consumed by core nfs-ganesha and gfAPI. CACHEINODE { Cache_Size(uint32, range 1 to UINT32_MAX, default 32633); # cache size Entries_HWMark(uint32, range 1 to UINT32_MAX, default 10); #Max no. of entries in the cache. } Thanks, Soumya 24.12.2015 16:32, Oleksandr Natalenko написав: Still actual issue for 3.7.6. Any suggestions? 24.09.2015 10:14, Oleksandr Natalenko написав: In our GlusterFS deployment we've encountered something like memory leak in GlusterFS FUSE client. We use replicated (×2) GlusterFS volume to store mail (exim+dovecot, maildir format). Here is inode stats for both bricks and mountpoint: === Brick 1 (Server 1): Filesystem Inodes IUsed IFree IUse% Mounted on /dev/mapper/vg_vd1_misc-lv08_mail 578768144 10954918 5678132262% /bricks/r6sdLV08_vd1_mail Brick 2 (Server 2): Filesystem Inodes IUsed IFree IUse% Mounted on /dev/mapper/vg_vd0_misc-lv07_mail 578767984
Re: [Gluster-devel] [Gluster-users] Memory leak in GlusterFS FUSE client
Here is another Valgrind log of similar scenario but with drop_caches before umount: https://gist.github.com/06997ecc8c7bce83aec1 Also, I've tried to drop caches on production VM with GluserFS volume mounted and memleaking for several weeks with absolutely no effect: === root 945 0.1 48.2 1273900 739244 ? Ssl 2015 58:54 /usr/sbin/ glusterfs --volfile-server=server.example.com --volfile-id=volume /mnt/volume === The numbers above stayed the same before drop as well as 5 mins after the drop. On неділя, 3 січня 2016 р. 13:35:51 EET Vijay Bellur wrote: > /proc/sys/vm/drop_caches ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] Memory leak in GlusterFS FUSE client
Another Valgrind run. I did the following: === valgrind --leak-check=full --show-leak-kinds=all --log- file="valgrind_fuse.log" /usr/bin/glusterfs -N --volfile- server=some.server.com --volfile-id=somevolume /mnt/volume === then cd to /mnt/volume and find . -type f. After traversing some part of hierarchy I've stopped find and did umount /mnt/volume. Here is valgrind_fuse.log file: https://gist.github.com/7e2679e1e72e48f75a2b On четвер, 31 грудня 2015 р. 14:09:03 EET Soumya Koduri wrote: > On 12/28/2015 02:32 PM, Soumya Koduri wrote: > > - Original Message - > > > >> From: "Pranith Kumar Karampuri" > >> To: "Oleksandr Natalenko" , "Soumya Koduri" > >> Cc: gluster-us...@gluster.org, > >> gluster-devel@gluster.org > >> Sent: Monday, December 28, 2015 9:32:07 AM > >> Subject: Re: [Gluster-devel] [Gluster-users] Memory leak in GlusterFS > >> FUSE client>> > >> On 12/26/2015 04:45 AM, Oleksandr Natalenko wrote: > >>> Also, here is valgrind output with our custom tool, that does GlusterFS > >>> volume > >>> traversing (with simple stats) just like find tool. In this case > >>> NFS-Ganesha > >>> is not used. > >>> > >>> https://gist.github.com/e4602a50d3c98f7a2766 > >> > >> hi Oleksandr, > >> > >> I went through the code. Both NFS Ganesha and the custom tool use > >> > >> gfapi and the leak is stemming from that. I am not very familiar with > >> this part of code but there seems to be one inode_unref() that is > >> missing in failure path of resolution. Not sure if that is corresponding > >> to the leaks. > >> > >> Soumya, > >> > >> Could this be the issue? review.gluster.org seems to be down. So > >> > >> couldn't send the patch. Please ping me on IRC. > >> diff --git a/api/src/glfs-resolve.c b/api/src/glfs-resolve.c > >> index b5efcba..52b538b 100644 > >> --- a/api/src/glfs-resolve.c > >> +++ b/api/src/glfs-resolve.c > >> @@ -467,9 +467,11 @@ priv_glfs_resolve_at (struct glfs *fs, xlator_t > >> *subvol, inode_t *at, > >> > >> } > >> > >> } > >> > >> - if (parent && next_component) > >> + if (parent && next_component) { > >> + inode_unref (parent); > >> + parent = NULL; > >> > >> /* resolution failed mid-way */ > >> goto out; > >> > >> +} > >> > >> /* At this point, all components up to the last parent > >> directory > >> > >> have been resolved successfully (@parent). Resolution of > >> > >> basename > > > > yes. This could be one of the reasons. There are few leaks with respect to > > inode references in gfAPI. See below. > > > > > > On GlusterFS side, looks like majority of the leaks are related to inodes > > and their contexts. Possible reasons which I can think of are: > > > > 1) When there is a graph switch, old inode table and their entries are not > > purged (this is a known issue). There was an effort put to fix this > > issue. But I think it had other side-effects and hence not been applied. > > Maybe we should revive those changes again. > > > > 2) With regard to above, old entries can be purged in case if any request > > comes with the reference to old inode (as part of 'glfs_resolve_inode'), > > provided their reference counts are properly decremented. But this is not > > happening at the moment in gfapi. > > > > 3) Applications should hold and release their reference as needed and > > required. There are certain fixes needed in this area as well (including > > the fix provided by Pranith above).> > > From code-inspection, have made changes to fix few leaks of case (2) & > > (3) with respect to gfAPI.> > > http://review.gluster.org/#/c/13096 (yet to test the changes) > > > > I haven't yet narrowed down any suspects pertaining to only NFS-Ganesha. > > Will re-check and update. > I tried similar tests but with smaller set of files. I could see the > inode_ctx leak even without graph switches involved. I suspect that > could be because valgrind checks for memory leaks during the exit of the > program. We call 'glfs_fini()' to cleanup the memory being used b
Re: [Gluster-devel] [Gluster-users] Memory leak in GlusterFS FUSE client
Also, here is valgrind output with our custom tool, that does GlusterFS volume traversing (with simple stats) just like find tool. In this case NFS-Ganesha is not used. https://gist.github.com/e4602a50d3c98f7a2766 One may see GlusterFS-related leaks here as well. On пʼятниця, 25 грудня 2015 р. 20:28:13 EET Soumya Koduri wrote: > On 12/24/2015 09:17 PM, Oleksandr Natalenko wrote: > > Another addition: it seems to be GlusterFS API library memory leak > > because NFS-Ganesha also consumes huge amount of memory while doing > > ordinary "find . -type f" via NFSv4.2 on remote client. Here is memory > > usage: > > > > === > > root 5416 34.2 78.5 2047176 1480552 ? Ssl 12:02 117:54 > > /usr/bin/ganesha.nfsd -L /var/log/ganesha.log -f > > /etc/ganesha/ganesha.conf -N NIV_EVENT > > === > > > > 1.4G is too much for simple stat() :(. > > > > Ideas? > > nfs-ganesha also has cache layer which can scale to millions of entries > depending on the number of files/directories being looked upon. However > there are parameters to tune it. So either try stat with few entries or > add below block in nfs-ganesha.conf file, set low limits and check the > difference. That may help us narrow down how much memory actually > consumed by core nfs-ganesha and gfAPI. > > CACHEINODE { > Cache_Size(uint32, range 1 to UINT32_MAX, default 32633); # cache size > Entries_HWMark(uint32, range 1 to UINT32_MAX, default 10); #Max no. > of entries in the cache. > } > > Thanks, > Soumya > > > 24.12.2015 16:32, Oleksandr Natalenko написав: > >> Still actual issue for 3.7.6. Any suggestions? > >> > >> 24.09.2015 10:14, Oleksandr Natalenko написав: > >>> In our GlusterFS deployment we've encountered something like memory > >>> leak in GlusterFS FUSE client. > >>> > >>> We use replicated (×2) GlusterFS volume to store mail (exim+dovecot, > >>> maildir format). Here is inode stats for both bricks and mountpoint: > >>> > >>> === > >>> Brick 1 (Server 1): > >>> > >>> Filesystem InodesIUsed > >>> > >>> IFree IUse% Mounted on > >>> > >>> /dev/mapper/vg_vd1_misc-lv08_mail 578768144 10954918 > >>> > >>> 5678132262% /bricks/r6sdLV08_vd1_mail > >>> > >>> Brick 2 (Server 2): > >>> > >>> Filesystem InodesIUsed > >>> > >>> IFree IUse% Mounted on > >>> > >>> /dev/mapper/vg_vd0_misc-lv07_mail 578767984 10954913 > >>> > >>> 5678130712% /bricks/r6sdLV07_vd0_mail > >>> > >>> Mountpoint (Server 3): > >>> > >>> Filesystem InodesIUsed IFree > >>> IUse% Mounted on > >>> glusterfs.xxx:mail 578767760 10954915 567812845 > >>> 2% /var/spool/mail/virtual > >>> === > >>> > >>> glusterfs.xxx domain has two A records for both Server 1 and Server 2. > >>> > >>> Here is volume info: > >>> > >>> === > >>> Volume Name: mail > >>> Type: Replicate > >>> Volume ID: f564e85c-7aa6-4170-9417-1f501aa98cd2 > >>> Status: Started > >>> Number of Bricks: 1 x 2 = 2 > >>> Transport-type: tcp > >>> Bricks: > >>> Brick1: server1.xxx:/bricks/r6sdLV08_vd1_mail/mail > >>> Brick2: server2.xxx:/bricks/r6sdLV07_vd0_mail/mail > >>> Options Reconfigured: > >>> nfs.rpc-auth-allow: 1.2.4.0/24,4.5.6.0/24 > >>> features.cache-invalidation-timeout: 10 > >>> performance.stat-prefetch: off > >>> performance.quick-read: on > >>> performance.read-ahead: off > >>> performance.flush-behind: on > >>> performance.write-behind: on > >>> performance.io-thread-count: 4 > >>> performance.cache-max-file-size: 1048576 > >>> performance.cache-size: 67108864 > >>> performance.readdir-ahead: off > >>> === > >>> > >>> Soon enough after mounting and exim/dovecot start, glusterfs client > >>> process begins to consume huge amount of RAM: > >>> > >>> === > >>> user@server3 ~$ ps aux | grep glusterfs | grep mail > >>> root 28895 14.4 15.0 15510324 14908868 ? Ssl Sep03
Re: [Gluster-devel] [Gluster-users] Memory leak in GlusterFS FUSE client
OK, I've rebuild GlusterFS v3.7.6 with debug enabled as well as NFS-Ganesha with debug enabled as well (and libc allocator). Here is my test steps: 1. launch nfs-ganesha: valgrind --leak-check=full --show-leak-kinds=all --log-file="valgrind.log" / opt/nfs-ganesha/bin/ganesha.nfsd -F -L ./ganesha.log -f ./ganesha.conf -N NIV_EVENT 2. mount NFS share: mount -t nfs4 127.0.0.1:/share share -o defaults,_netdev,minorversion=2,noac,noacl,lookupcache=none,timeo=100 3. cd to share and run find . for some time 4. CTRL+C find, unmount share. 5. CTRL+C NFS-Ganesha. Here is full valgrind output: https://gist.github.com/eebd9f94ababd8130d49 One may see the probability of massive leaks at the end of valgrind output related to both GlusterFS and NFS-Ganesha code. On пʼятниця, 25 грудня 2015 р. 23:29:07 EET Soumya Koduri wrote: > On 12/25/2015 08:56 PM, Oleksandr Natalenko wrote: > > What units Cache_Size is measured in? Bytes? > > Its actually (Cache_Size * sizeof_ptr) bytes. If possible, could you > please run ganesha process under valgrind? Will help in detecting leaks. > > Thanks, > Soumya > > > 25.12.2015 16:58, Soumya Koduri написав: > >> On 12/24/2015 09:17 PM, Oleksandr Natalenko wrote: > >>> Another addition: it seems to be GlusterFS API library memory leak > >>> because NFS-Ganesha also consumes huge amount of memory while doing > >>> ordinary "find . -type f" via NFSv4.2 on remote client. Here is memory > >>> usage: > >>> > >>> === > >>> root 5416 34.2 78.5 2047176 1480552 ? Ssl 12:02 117:54 > >>> /usr/bin/ganesha.nfsd -L /var/log/ganesha.log -f > >>> /etc/ganesha/ganesha.conf -N NIV_EVENT > >>> === > >>> > >>> 1.4G is too much for simple stat() :(. > >>> > >>> Ideas? > >> > >> nfs-ganesha also has cache layer which can scale to millions of > >> entries depending on the number of files/directories being looked > >> upon. However there are parameters to tune it. So either try stat with > >> few entries or add below block in nfs-ganesha.conf file, set low > >> limits and check the difference. That may help us narrow down how much > >> memory actually consumed by core nfs-ganesha and gfAPI. > >> > >> CACHEINODE { > >> > >> Cache_Size(uint32, range 1 to UINT32_MAX, default 32633); # cache > >> > >> size > >> > >> Entries_HWMark(uint32, range 1 to UINT32_MAX, default 10); #Max > >> > >> no. of entries in the cache. > >> } > >> > >> Thanks, > >> Soumya > >> > >>> 24.12.2015 16:32, Oleksandr Natalenko написав: > >>>> Still actual issue for 3.7.6. Any suggestions? > >>>> > >>>> 24.09.2015 10:14, Oleksandr Natalenko написав: > >>>>> In our GlusterFS deployment we've encountered something like memory > >>>>> leak in GlusterFS FUSE client. > >>>>> > >>>>> We use replicated (×2) GlusterFS volume to store mail (exim+dovecot, > >>>>> maildir format). Here is inode stats for both bricks and mountpoint: > >>>>> > >>>>> === > >>>>> Brick 1 (Server 1): > >>>>> > >>>>> Filesystem Inodes IUsed > >>>>> > >>>>> IFree IUse% Mounted on > >>>>> > >>>>> /dev/mapper/vg_vd1_misc-lv08_mail 578768144 10954918 > >>>>> > >>>>> 5678132262% /bricks/r6sdLV08_vd1_mail > >>>>> > >>>>> Brick 2 (Server 2): > >>>>> > >>>>> Filesystem Inodes IUsed > >>>>> > >>>>> IFree IUse% Mounted on > >>>>> > >>>>> /dev/mapper/vg_vd0_misc-lv07_mail 578767984 10954913 > >>>>> > >>>>> 5678130712% /bricks/r6sdLV07_vd0_mail > >>>>> > >>>>> Mountpoint (Server 3): > >>>>> > >>>>> Filesystem InodesIUsed IFree > >>>>> IUse% Mounted on > >>>>> glusterfs.xxx:mail 578767760 10954915 567812845 > >>>>> 2% /var/spool/mail/virtual > >>>>> === > >>>>> > >>>>> glusterfs.xxx domain ha
Re: [Gluster-devel] [Gluster-users] Memory leak in GlusterFS FUSE client
1. test with Cache_Size = 256 and Entries_HWMark = 4096 Before find . -type f: root 3120 0.6 11.0 879120 208408 ? Ssl 17:39 0:00 /usr/bin/ ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_EVENT After: root 3120 11.4 24.3 1170076 458168 ? Ssl 17:39 13:39 /usr/bin/ ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_EVENT ~250M leak. 2. test with default values (after ganesha restart) Before: root 24937 1.3 10.4 875016 197808 ? Ssl 19:39 0:00 /usr/bin/ ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_EVENT After: root 24937 3.5 18.9 1022544 356340 ? Ssl 19:39 0:40 /usr/bin/ ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_EVENT ~159M leak. No reasonable correlation detected. Second test was finished much faster than first (I guess, server-side GlusterFS cache or server kernel page cache is the cause). There are ~1.8M files on this test volume. On пʼятниця, 25 грудня 2015 р. 20:28:13 EET Soumya Koduri wrote: > On 12/24/2015 09:17 PM, Oleksandr Natalenko wrote: > > Another addition: it seems to be GlusterFS API library memory leak > > because NFS-Ganesha also consumes huge amount of memory while doing > > ordinary "find . -type f" via NFSv4.2 on remote client. Here is memory > > usage: > > > > === > > root 5416 34.2 78.5 2047176 1480552 ? Ssl 12:02 117:54 > > /usr/bin/ganesha.nfsd -L /var/log/ganesha.log -f > > /etc/ganesha/ganesha.conf -N NIV_EVENT > > === > > > > 1.4G is too much for simple stat() :(. > > > > Ideas? > > nfs-ganesha also has cache layer which can scale to millions of entries > depending on the number of files/directories being looked upon. However > there are parameters to tune it. So either try stat with few entries or > add below block in nfs-ganesha.conf file, set low limits and check the > difference. That may help us narrow down how much memory actually > consumed by core nfs-ganesha and gfAPI. > > CACHEINODE { > Cache_Size(uint32, range 1 to UINT32_MAX, default 32633); # cache size > Entries_HWMark(uint32, range 1 to UINT32_MAX, default 10); #Max no. > of entries in the cache. > } > > Thanks, > Soumya > > > 24.12.2015 16:32, Oleksandr Natalenko написав: > >> Still actual issue for 3.7.6. Any suggestions? > >> > >> 24.09.2015 10:14, Oleksandr Natalenko написав: > >>> In our GlusterFS deployment we've encountered something like memory > >>> leak in GlusterFS FUSE client. > >>> > >>> We use replicated (×2) GlusterFS volume to store mail (exim+dovecot, > >>> maildir format). Here is inode stats for both bricks and mountpoint: > >>> > >>> === > >>> Brick 1 (Server 1): > >>> > >>> Filesystem InodesIUsed > >>> > >>> IFree IUse% Mounted on > >>> > >>> /dev/mapper/vg_vd1_misc-lv08_mail 578768144 10954918 > >>> > >>> 5678132262% /bricks/r6sdLV08_vd1_mail > >>> > >>> Brick 2 (Server 2): > >>> > >>> Filesystem InodesIUsed > >>> > >>> IFree IUse% Mounted on > >>> > >>> /dev/mapper/vg_vd0_misc-lv07_mail 578767984 10954913 > >>> > >>> 5678130712% /bricks/r6sdLV07_vd0_mail > >>> > >>> Mountpoint (Server 3): > >>> > >>> Filesystem InodesIUsed IFree > >>> IUse% Mounted on > >>> glusterfs.xxx:mail 578767760 10954915 567812845 > >>> 2% /var/spool/mail/virtual > >>> === > >>> > >>> glusterfs.xxx domain has two A records for both Server 1 and Server 2. > >>> > >>> Here is volume info: > >>> > >>> === > >>> Volume Name: mail > >>> Type: Replicate > >>> Volume ID: f564e85c-7aa6-4170-9417-1f501aa98cd2 > >>> Status: Started > >>> Number of Bricks: 1 x 2 = 2 > >>> Transport-type: tcp > >>> Bricks: > >>> Brick1: server1.xxx:/bricks/r6sdLV08_vd1_mail/mail > >>> Brick2: server2.xxx:/bricks/r6sdLV07_vd0_mail/mail > >>> Options Reconfigured: > >>> nfs.rpc-auth-allow: 1.2.4.0/24,4.5.6.0/24 > >>> features.cache-invalidation-timeout: 10 > >>> performance.stat-prefetch: off > >>> perfo
Re: [Gluster-devel] [Gluster-users] Memory leak in GlusterFS FUSE client
What units Cache_Size is measured in? Bytes? 25.12.2015 16:58, Soumya Koduri написав: On 12/24/2015 09:17 PM, Oleksandr Natalenko wrote: Another addition: it seems to be GlusterFS API library memory leak because NFS-Ganesha also consumes huge amount of memory while doing ordinary "find . -type f" via NFSv4.2 on remote client. Here is memory usage: === root 5416 34.2 78.5 2047176 1480552 ? Ssl 12:02 117:54 /usr/bin/ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_EVENT === 1.4G is too much for simple stat() :(. Ideas? nfs-ganesha also has cache layer which can scale to millions of entries depending on the number of files/directories being looked upon. However there are parameters to tune it. So either try stat with few entries or add below block in nfs-ganesha.conf file, set low limits and check the difference. That may help us narrow down how much memory actually consumed by core nfs-ganesha and gfAPI. CACHEINODE { Cache_Size(uint32, range 1 to UINT32_MAX, default 32633); # cache size Entries_HWMark(uint32, range 1 to UINT32_MAX, default 10); #Max no. of entries in the cache. } Thanks, Soumya 24.12.2015 16:32, Oleksandr Natalenko написав: Still actual issue for 3.7.6. Any suggestions? 24.09.2015 10:14, Oleksandr Natalenko написав: In our GlusterFS deployment we've encountered something like memory leak in GlusterFS FUSE client. We use replicated (×2) GlusterFS volume to store mail (exim+dovecot, maildir format). Here is inode stats for both bricks and mountpoint: === Brick 1 (Server 1): Filesystem Inodes IUsed IFree IUse% Mounted on /dev/mapper/vg_vd1_misc-lv08_mail 578768144 10954918 5678132262% /bricks/r6sdLV08_vd1_mail Brick 2 (Server 2): Filesystem Inodes IUsed IFree IUse% Mounted on /dev/mapper/vg_vd0_misc-lv07_mail 578767984 10954913 5678130712% /bricks/r6sdLV07_vd0_mail Mountpoint (Server 3): Filesystem InodesIUsed IFree IUse% Mounted on glusterfs.xxx:mail 578767760 10954915 567812845 2% /var/spool/mail/virtual === glusterfs.xxx domain has two A records for both Server 1 and Server 2. Here is volume info: === Volume Name: mail Type: Replicate Volume ID: f564e85c-7aa6-4170-9417-1f501aa98cd2 Status: Started Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: server1.xxx:/bricks/r6sdLV08_vd1_mail/mail Brick2: server2.xxx:/bricks/r6sdLV07_vd0_mail/mail Options Reconfigured: nfs.rpc-auth-allow: 1.2.4.0/24,4.5.6.0/24 features.cache-invalidation-timeout: 10 performance.stat-prefetch: off performance.quick-read: on performance.read-ahead: off performance.flush-behind: on performance.write-behind: on performance.io-thread-count: 4 performance.cache-max-file-size: 1048576 performance.cache-size: 67108864 performance.readdir-ahead: off === Soon enough after mounting and exim/dovecot start, glusterfs client process begins to consume huge amount of RAM: === user@server3 ~$ ps aux | grep glusterfs | grep mail root 28895 14.4 15.0 15510324 14908868 ? Ssl Sep03 4310:05 /usr/sbin/glusterfs --fopen-keep-cache --direct-io-mode=disable --volfile-server=glusterfs.xxx --volfile-id=mail /var/spool/mail/virtual === That is, ~15 GiB of RAM. Also we've tried to use mountpoint withing separate KVM VM with 2 or 3 GiB of RAM, and soon after starting mail daemons got OOM killer for glusterfs client process. Mounting same share via NFS works just fine. Also, we have much less iowait and loadavg on client side with NFS. Also, we've tried to change IO threads count and cache size in order to limit memory usage with no luck. As you can see, total cache size is 4×64==256 MiB (compare to 15 GiB). Enabling-disabling stat-prefetch, read-ahead and readdir-ahead didn't help as well. Here are volume memory stats: === Memory status for volume : mail -- Brick : server1.xxx:/bricks/r6sdLV08_vd1_mail/mail Mallinfo Arena: 36859904 Ordblks : 10357 Smblks : 519 Hblks: 21 Hblkhd : 30515200 Usmblks : 0 Fsmblks : 53440 Uordblks : 18604144 Fordblks : 18255760 Keepcost : 114112 Mempool Stats - NameHotCount ColdCount PaddedSizeof AllocCount MaxAlloc Misses Max-StdAlloc - -- mail-server:fd_t 0 1024 108 30773120 13700 mail-server:dentry_t 16110 274 84 23567614816384 1106499 1152 mail-server:inode_t1636321 156 23721687616384 1876651 1169 mail-trash:fd_t0 1024 108 0000 mail-trash
Re: [Gluster-devel] Memory leak in GlusterFS FUSE client
Another addition: it seems to be GlusterFS API library memory leak because NFS-Ganesha also consumes huge amount of memory while doing ordinary "find . -type f" via NFSv4.2 on remote client. Here is memory usage: === root 5416 34.2 78.5 2047176 1480552 ? Ssl 12:02 117:54 /usr/bin/ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_EVENT === 1.4G is too much for simple stat() :(. Ideas? 24.12.2015 16:32, Oleksandr Natalenko написав: Still actual issue for 3.7.6. Any suggestions? 24.09.2015 10:14, Oleksandr Natalenko написав: In our GlusterFS deployment we've encountered something like memory leak in GlusterFS FUSE client. We use replicated (×2) GlusterFS volume to store mail (exim+dovecot, maildir format). Here is inode stats for both bricks and mountpoint: === Brick 1 (Server 1): Filesystem InodesIUsed IFree IUse% Mounted on /dev/mapper/vg_vd1_misc-lv08_mail 578768144 10954918 5678132262% /bricks/r6sdLV08_vd1_mail Brick 2 (Server 2): Filesystem InodesIUsed IFree IUse% Mounted on /dev/mapper/vg_vd0_misc-lv07_mail 578767984 10954913 5678130712% /bricks/r6sdLV07_vd0_mail Mountpoint (Server 3): Filesystem InodesIUsed IFree IUse% Mounted on glusterfs.xxx:mail 578767760 10954915 567812845 2% /var/spool/mail/virtual === glusterfs.xxx domain has two A records for both Server 1 and Server 2. Here is volume info: === Volume Name: mail Type: Replicate Volume ID: f564e85c-7aa6-4170-9417-1f501aa98cd2 Status: Started Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: server1.xxx:/bricks/r6sdLV08_vd1_mail/mail Brick2: server2.xxx:/bricks/r6sdLV07_vd0_mail/mail Options Reconfigured: nfs.rpc-auth-allow: 1.2.4.0/24,4.5.6.0/24 features.cache-invalidation-timeout: 10 performance.stat-prefetch: off performance.quick-read: on performance.read-ahead: off performance.flush-behind: on performance.write-behind: on performance.io-thread-count: 4 performance.cache-max-file-size: 1048576 performance.cache-size: 67108864 performance.readdir-ahead: off === Soon enough after mounting and exim/dovecot start, glusterfs client process begins to consume huge amount of RAM: === user@server3 ~$ ps aux | grep glusterfs | grep mail root 28895 14.4 15.0 15510324 14908868 ? Ssl Sep03 4310:05 /usr/sbin/glusterfs --fopen-keep-cache --direct-io-mode=disable --volfile-server=glusterfs.xxx --volfile-id=mail /var/spool/mail/virtual === That is, ~15 GiB of RAM. Also we've tried to use mountpoint withing separate KVM VM with 2 or 3 GiB of RAM, and soon after starting mail daemons got OOM killer for glusterfs client process. Mounting same share via NFS works just fine. Also, we have much less iowait and loadavg on client side with NFS. Also, we've tried to change IO threads count and cache size in order to limit memory usage with no luck. As you can see, total cache size is 4×64==256 MiB (compare to 15 GiB). Enabling-disabling stat-prefetch, read-ahead and readdir-ahead didn't help as well. Here are volume memory stats: === Memory status for volume : mail -- Brick : server1.xxx:/bricks/r6sdLV08_vd1_mail/mail Mallinfo Arena: 36859904 Ordblks : 10357 Smblks : 519 Hblks: 21 Hblkhd : 30515200 Usmblks : 0 Fsmblks : 53440 Uordblks : 18604144 Fordblks : 18255760 Keepcost : 114112 Mempool Stats - NameHotCount ColdCount PaddedSizeof AllocCount MaxAlloc Misses Max-StdAlloc - -- mail-server:fd_t 0 1024 108 30773120 13700 mail-server:dentry_t 16110 274 84 23567614816384 1106499 1152 mail-server:inode_t1636321 156 23721687616384 1876651 1169 mail-trash:fd_t0 1024 108 0000 mail-trash:dentry_t0 32768 84 0000 mail-trash:inode_t 4 32764 156 4400 mail-trash:trash_local_t 064 8628 0000 mail-changetimerecorder:gf_ctr_local_t 064 16540 0000 mail-changelog:rpcsvc_request_t 0 8 2828 0000 mail-changelog:changelog_local_t 064 116 0000 mail-bitrot-stub:br_stub_local_t 0 512 84 79204400 mail-locks:pl_local_t 032
Re: [Gluster-devel] Memory leak in GlusterFS FUSE client
Still actual issue for 3.7.6. Any suggestions? 24.09.2015 10:14, Oleksandr Natalenko написав: In our GlusterFS deployment we've encountered something like memory leak in GlusterFS FUSE client. We use replicated (×2) GlusterFS volume to store mail (exim+dovecot, maildir format). Here is inode stats for both bricks and mountpoint: === Brick 1 (Server 1): Filesystem InodesIUsed IFree IUse% Mounted on /dev/mapper/vg_vd1_misc-lv08_mail 578768144 10954918 5678132262% /bricks/r6sdLV08_vd1_mail Brick 2 (Server 2): Filesystem InodesIUsed IFree IUse% Mounted on /dev/mapper/vg_vd0_misc-lv07_mail 578767984 10954913 5678130712% /bricks/r6sdLV07_vd0_mail Mountpoint (Server 3): Filesystem InodesIUsed IFree IUse% Mounted on glusterfs.xxx:mail 578767760 10954915 567812845 2% /var/spool/mail/virtual === glusterfs.xxx domain has two A records for both Server 1 and Server 2. Here is volume info: === Volume Name: mail Type: Replicate Volume ID: f564e85c-7aa6-4170-9417-1f501aa98cd2 Status: Started Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: server1.xxx:/bricks/r6sdLV08_vd1_mail/mail Brick2: server2.xxx:/bricks/r6sdLV07_vd0_mail/mail Options Reconfigured: nfs.rpc-auth-allow: 1.2.4.0/24,4.5.6.0/24 features.cache-invalidation-timeout: 10 performance.stat-prefetch: off performance.quick-read: on performance.read-ahead: off performance.flush-behind: on performance.write-behind: on performance.io-thread-count: 4 performance.cache-max-file-size: 1048576 performance.cache-size: 67108864 performance.readdir-ahead: off === Soon enough after mounting and exim/dovecot start, glusterfs client process begins to consume huge amount of RAM: === user@server3 ~$ ps aux | grep glusterfs | grep mail root 28895 14.4 15.0 15510324 14908868 ? Ssl Sep03 4310:05 /usr/sbin/glusterfs --fopen-keep-cache --direct-io-mode=disable --volfile-server=glusterfs.xxx --volfile-id=mail /var/spool/mail/virtual === That is, ~15 GiB of RAM. Also we've tried to use mountpoint withing separate KVM VM with 2 or 3 GiB of RAM, and soon after starting mail daemons got OOM killer for glusterfs client process. Mounting same share via NFS works just fine. Also, we have much less iowait and loadavg on client side with NFS. Also, we've tried to change IO threads count and cache size in order to limit memory usage with no luck. As you can see, total cache size is 4×64==256 MiB (compare to 15 GiB). Enabling-disabling stat-prefetch, read-ahead and readdir-ahead didn't help as well. Here are volume memory stats: === Memory status for volume : mail -- Brick : server1.xxx:/bricks/r6sdLV08_vd1_mail/mail Mallinfo Arena: 36859904 Ordblks : 10357 Smblks : 519 Hblks: 21 Hblkhd : 30515200 Usmblks : 0 Fsmblks : 53440 Uordblks : 18604144 Fordblks : 18255760 Keepcost : 114112 Mempool Stats - NameHotCount ColdCount PaddedSizeof AllocCount MaxAlloc Misses Max-StdAlloc - -- mail-server:fd_t 0 1024 108 30773120 13700 mail-server:dentry_t 16110 274 84 23567614816384 1106499 1152 mail-server:inode_t1636321 156 23721687616384 1876651 1169 mail-trash:fd_t0 1024 108 0000 mail-trash:dentry_t0 32768 84 0000 mail-trash:inode_t 4 32764 156 4400 mail-trash:trash_local_t 064 8628 0000 mail-changetimerecorder:gf_ctr_local_t 064 16540 0000 mail-changelog:rpcsvc_request_t 0 8 2828 0000 mail-changelog:changelog_local_t 064 116 0000 mail-bitrot-stub:br_stub_local_t 0 512 84 79204400 mail-locks:pl_local_t 032 148 6812757400 mail-upcall:upcall_local_t 0 512 108 0000 mail-marker:marker_local_t 0 128 332 64980300 mail-quota:quota_local_t 064 476 0000 mail-server:rpcsvc_request_t 0 512 2828 45462533 3400 glusterfs:struct saved_frame 0
Re: [Gluster-devel] [Gluster-users] Memory leak in GlusterFS FUSE client
Here are two consecutive statedumps of brick in question memory usage [1] [2]. glusterfs client process went from ~630 MB to ~1350 MB of memory usage in less than one hour. Volume options: === cluster.lookup-optimize: on cluster.readdir-optimize: on client.event-threads: 4 network.inode-lru-limit: 4096 server.event-threads: 8 performance.client-io-threads: on storage.linux-aio: on performance.write-behind-window-size: 4194304 performance.stat-prefetch: on performance.quick-read: on performance.read-ahead: on performance.flush-behind: on performance.write-behind: on performance.io-thread-count: 4 performance.cache-max-file-size: 1048576 performance.cache-size: 33554432 performance.readdir-ahead: on === I observe such a behavior on similar volumes where millions of files are stored. The volume in question holds ~11M of small files (mail storage). So, memleak persists. Had to switch to NFS temporarily :(. Any idea? [1] https://gist.github.com/46697b70ffe193fa797e [2] https://gist.github.com/3a968ca909bfdeb31cca 28.09.2015 14:31, Raghavendra Bhat написав: Hi Oleksandr, You are right. The description should have said it as the limit on the number of inodes in the lru list of the inode cache. I have sent a patch for that. http://review.gluster.org/#/c/12242/ [3] Regards, Raghavendra Bhat On Thu, Sep 24, 2015 at 1:44 PM, Oleksandr Natalenko wrote: I've checked statedump of volume in question and haven't found lots of iobuf as mentioned in that bugreport. However, I've noticed that there are lots of LRU records like this: === [conn.1.bound_xl./bricks/r6sdLV07_vd0_mail/mail.lru.1] gfid=c4b29310-a19d-451b-8dd1-b3ac2d86b595 nlookup=1 fd-count=0 ref=0 ia_type=1 === In fact, there are 16383 of them. I've checked "gluster volume set help" in order to find something LRU-related and have found this: === Option: network.inode-lru-limit Default Value: 16384 Description: Specifies the maximum megabytes of memory to be used in the inode cache. === Is there error in description stating "maximum megabytes of memory"? Shouldn't it mean "maximum amount of LRU records"? If no, is that true, that inode cache could grow up to 16 GiB for client, and one must lower network.inode-lru-limit value? Another thought: we've enabled write-behind, and the default write-behind-window-size value is 1 MiB. So, one may conclude that with lots of small files written, write-behind buffer could grow up to inode-lru-limit×write-behind-window-size=16 GiB? Who could explain that to me? 24.09.2015 10:42, Gabi C write: oh, my bad... coulb be this one? https://bugzilla.redhat.com/show_bug.cgi?id=1126831 [1] [2] Anyway, on ovirt+gluster w I experienced similar behavior... ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel [2] Links: -- [1] https://bugzilla.redhat.com/show_bug.cgi?id=1126831 [2] http://www.gluster.org/mailman/listinfo/gluster-devel [3] http://review.gluster.org/#/c/12242/ ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] Memory leak in GlusterFS FUSE client
I've checked statedump of volume in question and haven't found lots of iobuf as mentioned in that bugreport. However, I've noticed that there are lots of LRU records like this: === [conn.1.bound_xl./bricks/r6sdLV07_vd0_mail/mail.lru.1] gfid=c4b29310-a19d-451b-8dd1-b3ac2d86b595 nlookup=1 fd-count=0 ref=0 ia_type=1 === In fact, there are 16383 of them. I've checked "gluster volume set help" in order to find something LRU-related and have found this: === Option: network.inode-lru-limit Default Value: 16384 Description: Specifies the maximum megabytes of memory to be used in the inode cache. === Is there error in description stating "maximum megabytes of memory"? Shouldn't it mean "maximum amount of LRU records"? If no, is that true, that inode cache could grow up to 16 GiB for client, and one must lower network.inode-lru-limit value? Another thought: we've enabled write-behind, and the default write-behind-window-size value is 1 MiB. So, one may conclude that with lots of small files written, write-behind buffer could grow up to inode-lru-limit×write-behind-window-size=16 GiB? Who could explain that to me? 24.09.2015 10:42, Gabi C write: oh, my bad... coulb be this one? https://bugzilla.redhat.com/show_bug.cgi?id=1126831 [2] Anyway, on ovirt+gluster w I experienced similar behavior... ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] Memory leak in GlusterFS FUSE client
We use bare GlusterFS installation with no oVirt involved. 24.09.2015 10:29, Gabi C wrote: google vdsm memory leak..it's been discussed on list last year and earlier this one... ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] Memory leak in GlusterFS FUSE client
In our GlusterFS deployment we've encountered something like memory leak in GlusterFS FUSE client. We use replicated (×2) GlusterFS volume to store mail (exim+dovecot, maildir format). Here is inode stats for both bricks and mountpoint: === Brick 1 (Server 1): Filesystem InodesIUsed IFree IUse% Mounted on /dev/mapper/vg_vd1_misc-lv08_mail 578768144 10954918 5678132262% /bricks/r6sdLV08_vd1_mail Brick 2 (Server 2): Filesystem InodesIUsed IFree IUse% Mounted on /dev/mapper/vg_vd0_misc-lv07_mail 578767984 10954913 5678130712% /bricks/r6sdLV07_vd0_mail Mountpoint (Server 3): Filesystem InodesIUsed IFree IUse% Mounted on glusterfs.xxx:mail 578767760 10954915 5678128452% /var/spool/mail/virtual === glusterfs.xxx domain has two A records for both Server 1 and Server 2. Here is volume info: === Volume Name: mail Type: Replicate Volume ID: f564e85c-7aa6-4170-9417-1f501aa98cd2 Status: Started Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: server1.xxx:/bricks/r6sdLV08_vd1_mail/mail Brick2: server2.xxx:/bricks/r6sdLV07_vd0_mail/mail Options Reconfigured: nfs.rpc-auth-allow: 1.2.4.0/24,4.5.6.0/24 features.cache-invalidation-timeout: 10 performance.stat-prefetch: off performance.quick-read: on performance.read-ahead: off performance.flush-behind: on performance.write-behind: on performance.io-thread-count: 4 performance.cache-max-file-size: 1048576 performance.cache-size: 67108864 performance.readdir-ahead: off === Soon enough after mounting and exim/dovecot start, glusterfs client process begins to consume huge amount of RAM: === user@server3 ~$ ps aux | grep glusterfs | grep mail root 28895 14.4 15.0 15510324 14908868 ? Ssl Sep03 4310:05 /usr/sbin/glusterfs --fopen-keep-cache --direct-io-mode=disable --volfile-server=glusterfs.xxx --volfile-id=mail /var/spool/mail/virtual === That is, ~15 GiB of RAM. Also we've tried to use mountpoint withing separate KVM VM with 2 or 3 GiB of RAM, and soon after starting mail daemons got OOM killer for glusterfs client process. Mounting same share via NFS works just fine. Also, we have much less iowait and loadavg on client side with NFS. Also, we've tried to change IO threads count and cache size in order to limit memory usage with no luck. As you can see, total cache size is 4×64==256 MiB (compare to 15 GiB). Enabling-disabling stat-prefetch, read-ahead and readdir-ahead didn't help as well. Here are volume memory stats: === Memory status for volume : mail -- Brick : server1.xxx:/bricks/r6sdLV08_vd1_mail/mail Mallinfo Arena: 36859904 Ordblks : 10357 Smblks : 519 Hblks: 21 Hblkhd : 30515200 Usmblks : 0 Fsmblks : 53440 Uordblks : 18604144 Fordblks : 18255760 Keepcost : 114112 Mempool Stats - NameHotCount ColdCount PaddedSizeof AllocCount MaxAlloc Misses Max-StdAlloc - -- mail-server:fd_t 0 1024 108 30773120 13700 mail-server:dentry_t 16110 274 84 23567614816384 1106499 1152 mail-server:inode_t1636321 156 23721687616384 1876651 1169 mail-trash:fd_t0 1024 108 0000 mail-trash:dentry_t0 32768 84 0000 mail-trash:inode_t 4 32764 156 4400 mail-trash:trash_local_t 064 8628 0000 mail-changetimerecorder:gf_ctr_local_t 06416540 0000 mail-changelog:rpcsvc_request_t 0 8 2828 0000 mail-changelog:changelog_local_t 064 116 0000 mail-bitrot-stub:br_stub_local_t 0 512 84 79204400 mail-locks:pl_local_t 032 148 6812757400 mail-upcall:upcall_local_t 0 512 108 0000 mail-marker:marker_local_t 0 128 332 64980300 mail-quota:quota_local_t 064 476 0000 mail-server:rpcsvc_request_t 0 512 2828 45462533 3400 glusterfs:struct saved_frame
[Gluster-devel] GlusterFS cache architecture
Hello. I'm trying to investigate how GlusterFS manages cache on both server and client side, but unfortunately cannot find any exhaustive, appropriate and up to date information. The disposition is that we have, saying, 2 GlusterFS nodes (server_a and server_b) with replicated volume some_volume. Also we have several clients (saying client_1 and client_2) that mount some_volume and do some manipulation with files on it (lets assume some_volume contains web-related assets, and client_1/client_2 are web-servers). Also there is client_3 that does web- related deploying on some_volume (lets assume that client_3 is web-developer). We would like to use multilayered cache scheme that involves filesystem cache (on both client/server sides) as well as web server cache. So, my questions are: 1) does caching-related items (performance.cache-size, performance.cache-min- file-size, performance.cache-max-file-size etc.) affect server side only? 2) are there any tunables that affect client side caching? 3) how client-side caching (we are talking about read cache only, write cache is not interesting to us) is performed (if it is at all)? 4) how and in what cases client cache is discarded (and how that relates to upcall framework)? Ideally, there should be some documentation that covers general GlusterFS cache workflow. Any info would be appreciated. Thanks. -- Oleksandr post-factum Natalenko, MSc pf-kernel community https://natalenko.name/ ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel