I let the test run awhile longer and it finally twice tripped the segfault error. Gdb showed:
#0 0x00000000004cdd9d in __glist_add (left=0x0, right=0x7f77110165a8, elt=0x7f772607eb90) at /home/keeper/work/ganesha/2.4.1-debug/ganeshabuilder/nfs-ganesha/src/include/gsh_list.h:78 78 left->next = elt; So both using NFS 4.1 and 4.2 clients tripped the assert and segfault problems. The totals for the NFS 4.1 client test run was 19 assert failures and 2 segfault failures. Eric On Fri, Nov 11, 2016 at 1:58 AM, Eric Eastman <[email protected]> wrote: > I am about 1/2 way through running the 48 hour test using NFS 4.1 on > the client, and I have had multiple assert failures caused by > "refcount = -1", but so far I have not seen any segfault failures, so > the assert failure is in both 4.1 and 4.2, but not in 4.0. If I don't > see a segfault failure by tomorrow, I am going to assume the segfault > failure only happens with 4.2. > > Regards, > Eric > > > On Thu, Nov 10, 2016 at 1:27 PM, Frank Filz <[email protected]> wrote: >>> I re-ran the same test for 48 hours using the NFS 4.0 mount option, to the >>> Ganesha NFS 2.4.1 server, with the client NFS fstab entry: >>> >>> ede-c2-gw01:/var/top /C2-NFS4 nfs4 rw,hard,noauto,vers=4.0 0 0 >>> >>> and I have not seen any assert or segfaults, so there something going on >>> when using vers=4.2 that is not seen with vers=4.0. When using vers=4.2, I >>> normally see more then 20 asserts or segfault per 24 hours when running my >>> test case. >>> >>> I am going to re-run my tests using vers=4.1 >> >> There has been relatively little 4.2 testing done with Ganesha, so it >> wouldn't surprise me there is some issue there. >> >> If it turns out to be 4.2 only, then we will need to examine what is >> different in the 4.2 flow. >> >> On the other hand, if it shows up in 4.1, then likely culprits are the >> session code and the way we handle state owner sequence checking (which is >> for 4.0 only) in conjunction with stateid validation. There's enough >> complexity in trying to handle the two different ways of validating >> statefull requests that I could easily see a refcount bug showing up. >> >> Frank >> >>> On Wed, Nov 2, 2016 at 12:20 PM, Frank Filz <[email protected]> >>> wrote: >>> > I'm playing with running Ganesha under valgrind and helgrind to see if >>> > anything drops out from those. >>> > >>> > Unfortunately helgrind seems to show up a lot of data races that >>> > either have no functional impact (stat collection that doesn't use >>> > atomic ops), a ton in the ntirpc code, and it also seems to >>> > misunderstand some atomic ops (I HAVE seen it complain before when >>> > something is accessed using atomic ops, but sometimes while holding a >>> > lock, and sometimes not, it decides the fact that there were unlocked >>> > accesses causes a race even though the atomic op should guarantee). >>> > >>> > Frank >>> > >>> >> -----Original Message----- >>> >> From: Malahal Naineni [mailto:[email protected]] >>> >> Sent: Tuesday, October 25, 2016 11:22 PM >>> >> To: Eric Eastman <[email protected]> >>> >> Cc: [email protected] >>> >> Subject: Re: [Nfs-ganesha-devel] assert in dec_state_owner_ref() with >>> >> V2.4.0.3 >>> >> >>> >> Please post if you have an easy reproducer. We will try to recreate >>> >> and >>> > root >>> >> cause it. >>> >> >>> >> On Wed, Oct 26, 2016 at 6:15 AM, Eric Eastman >>> >> <[email protected]> wrote: >>> >> > A little more info on this issue. I did a 24 hour run of my test >>> >> > using the POSIX FSAL with an ext4 file system as the backstore, and >>> >> > saw 9 asserts during this test run, all caused by the variable >>> >> > "refcount" ending up at -1. The errors seem to be occurring while >>> >> > running "rm -rf" on a directory with 1000 sub-directories, with >>> >> > each having 11 files in it. >>> >> > >>> >> > This looks to me like a race condition and I am having issues >>> >> > finding the root cause reading through the source code. There are >>> >> > notes from commit e7307c5, dated Jan 5 2016, on "Resolve race >>> >> > between get_state_owner and dec_state_owner_ref differently" so >>> >> > this looks like an area that there has been issues before. >>> >> > >>> >> > If anyone has an idea on what the root problem is or where to look, >>> >> > please let me know, as we cannot use Ganesha NFS if it is going to >>> >> > assert during production. >>> >> > >>> >> > Thanks, >>> >> > Eric >>> >> > >>> >> > On Thu, Oct 20, 2016 at 1:22 AM, Eric Eastman >>> >> > <[email protected]> wrote: >>> >> >> While testing Ganesha NFS V2.4.0.3 using the CEPH FSAL to a ceph >>> >> >> file system, I am seeing the ganesha.nfsd process die due to an >>> >> >> assert call multiple times per hour. I have also seen it die at >>> >> >> the same place in the code using the VFS FSAL with a ext4 file >>> >> >> system, but it dies much less often. >>> >> >> >>> >> >> It is dying at line 917 in src/SAL/state_misc.c, which is called >>> >> >> by src/SAL/state_misc.c at line 1010. The assert call is in >>> >> >> dec_state_owner_ref() at the line: >>> >> >> >>> >> >> assert(refcount > 0); >>> >> >> >>> >> >> Looking at the core files and adding in some debugging code >>> >> >> confirms that refcount is -1 when the assert call is made. >>> >> >> >>> >> >> It looks like the owner count is trying to go to -1 in >>> >> >> uncache_nfs4_owner(), but as it occurs only on occasions, I think >>> >> >> it is a race condition. >>> >> >> >>> >> >> Info on the build: >>> >> >> >>> >> >> Host OS is Ubuntu 14.04 with a 4.8.2 x86_64 kernel on a 8 >>> >> >> processor system >>> >> >> >>> >> >> Cmake command: >>> >> >> # cmake -DCMAKE_INSTALL_PREFIX=/opt/keeper - >>> >> DALLOCATOR=jemalloc >>> >> >> -DUSE_ADMIN_TOOLS=ON -DUSE_DBUS=ON ../src >>> >> >> >>> >> >> # ganesha.nfsd -v >>> >> >> ganesha.nfsd compiled on Oct 17 2016 at 16:50:18 Release = >>> >> >> V2.4.0.3 Release comment = GANESHA file server is 64 bits >>> >> >> compliant and supports NFS v3,4.0,4.1 (pNFS) and 9P Git HEAD = >>> >> >> 0f55a9a97a4bf232fb0e42542e4ca7491fbf84ce >>> >> >> Git Describe = V2.4.0.3-0-g0f55a9a >>> >> >> >>> >> >> # ceph -v >>> >> >> ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b) >>> >> >> >>> >> >> # cat ganesha.conf >>> >> >> LOG { >>> >> >> components { >>> >> >> ALL = INFO; >>> >> >> } >>> >> >> } >>> >> >> >>> >> >> EXPORT_DEFAULTS { >>> >> >> SecType = none, sys; >>> >> >> Protocols = 3, 4; >>> >> >> Transports = TCP; >>> >> >> } >>> >> >> >>> >> >> # define CephFS export >>> >> >> EXPORT { >>> >> >> Export_ID = 42; >>> >> >> Path = /top; >>> >> >> Pseudo = /top; >>> >> >> Access_Type = RW; >>> >> >> Squash = No_Root_Squash; >>> >> >> FSAL { >>> >> >> Name = CEPH; >>> >> >> } >>> >> >> } >>> >> >> >>> >> >> The VFS export for the ext4 tests was: >>> >> >> >>> >> >> # define CephFS export >>> >> >> EXPORT { >>> >> >> Export_ID = 43; >>> >> >> Path = /var/top; >>> >> >> Pseudo = /var/top; >>> >> >> Access_Type = RW; >>> >> >> Squash = No_Root_Squash; >>> >> >> FSAL { >>> >> >> Name = VFS; >>> >> >> } >>> >> >> } >>> >> >> >>> >> >> The test was 2 Ubuntu 14.04 NFS clients each having 6 processes, >>> >> >> writing 11,000 256k files in separate directory trees with 11 >>> >> >> files per lowest level node. On each Ubuntu client, 3 processes >>> >> >> wrote to a NFS 3 mount and 3 wrote to a NFS 4 mount. The files are >>> >> >> then read and verified, deleted, and the test restarts. >>> >> >> >>> >> >> Regards, >>> >> >> Eric >>> >> > >>> >> > ------------------------------------------------------------------- >>> >> > --- >>> >> > -------- The Command Line: Reinvented for Modern Developers Did the >>> >> > resurgence of CLI tooling catch you by surprise? >>> >> > Reconnect with the command line and become more productive. >>> >> > Learn the new .NET and ASP.NET CLI. Get your free copy! >>> >> > http://sdm.link/telerik >>> >> > _______________________________________________ >>> >> > Nfs-ganesha-devel mailing list >>> >> > [email protected] >>> >> > https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel >>> >> >>> >> >>> > ---------------------------------------------------------------------- >>> > ------ >>> > -- >>> >> The Command Line: Reinvented for Modern Developers Did the >>> resurgence >>> >> of CLI tooling catch you by surprise? >>> >> Reconnect with the command line and become more productive. >>> >> Learn the new .NET and ASP.NET CLI. Get your free copy! >>> >> http://sdm.link/telerik >>> >> _______________________________________________ >>> >> Nfs-ganesha-devel mailing list >>> >> [email protected] >>> >> https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel >>> > >>> > >>> > --- >>> > This email has been checked for viruses by Avast antivirus software. >>> > https://www.avast.com/antivirus >>> > >> >> >> --- >> This email has been checked for viruses by Avast antivirus software. >> https://www.avast.com/antivirus >> ------------------------------------------------------------------------------ _______________________________________________ Nfs-ganesha-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel
