I let the test run awhile longer and it finally twice tripped the
segfault error. Gdb showed:

#0  0x00000000004cdd9d in __glist_add (left=0x0, right=0x7f77110165a8,
elt=0x7f772607eb90) at
/home/keeper/work/ganesha/2.4.1-debug/ganeshabuilder/nfs-ganesha/src/include/gsh_list.h:78
78 left->next = elt;

So both using NFS 4.1 and 4.2 clients tripped the assert and segfault
problems.  The totals for the NFS 4.1 client test run was 19 assert
failures and 2 segfault failures.

Eric


On Fri, Nov 11, 2016 at 1:58 AM, Eric Eastman
<[email protected]> wrote:
> I am about 1/2 way through running the 48 hour test using NFS 4.1 on
> the client, and I have had multiple assert failures caused by
> "refcount = -1", but so far I have not seen any segfault failures, so
> the assert failure is in both 4.1 and 4.2, but not in 4.0.  If I don't
> see a segfault failure by tomorrow, I am going to assume the segfault
> failure only happens with 4.2.
>
> Regards,
> Eric
>
>
> On Thu, Nov 10, 2016 at 1:27 PM, Frank Filz <[email protected]> wrote:
>>> I re-ran the same test for 48 hours using the NFS 4.0 mount option, to the
>>> Ganesha NFS 2.4.1 server, with the client NFS fstab entry:
>>>
>>> ede-c2-gw01:/var/top /C2-NFS4 nfs4 rw,hard,noauto,vers=4.0  0 0
>>>
>>> and I have not seen any assert or segfaults, so there something going on
>>> when using vers=4.2 that is not seen with vers=4.0. When using vers=4.2, I
>>> normally see more then 20 asserts or segfault per 24 hours when running my
>>> test case.
>>>
>>> I am going to re-run my tests using vers=4.1
>>
>> There has been relatively little 4.2 testing done with Ganesha, so it 
>> wouldn't surprise me there is some issue there.
>>
>> If it turns out to be 4.2 only, then we will need to examine what is 
>> different in the 4.2 flow.
>>
>> On the other hand, if it shows up in 4.1, then likely culprits are the 
>> session code and the way we handle state owner sequence checking (which is 
>> for 4.0 only) in conjunction with stateid validation. There's enough 
>> complexity in trying to handle the two different ways of validating 
>> statefull requests that I could easily see a refcount bug showing up.
>>
>> Frank
>>
>>> On Wed, Nov 2, 2016 at 12:20 PM, Frank Filz <[email protected]>
>>> wrote:
>>> > I'm playing with running Ganesha under valgrind and helgrind to see if
>>> > anything drops out from those.
>>> >
>>> > Unfortunately helgrind seems to show up a lot of data races that
>>> > either have no functional impact (stat collection that doesn't use
>>> > atomic ops), a ton in the ntirpc code, and it also seems to
>>> > misunderstand some atomic ops (I HAVE seen it complain before when
>>> > something is accessed using atomic ops, but sometimes while holding a
>>> > lock, and sometimes not, it decides the fact that there were unlocked
>>> > accesses causes a race even though the atomic op should guarantee).
>>> >
>>> > Frank
>>> >
>>> >> -----Original Message-----
>>> >> From: Malahal Naineni [mailto:[email protected]]
>>> >> Sent: Tuesday, October 25, 2016 11:22 PM
>>> >> To: Eric Eastman <[email protected]>
>>> >> Cc: [email protected]
>>> >> Subject: Re: [Nfs-ganesha-devel] assert in dec_state_owner_ref() with
>>> >> V2.4.0.3
>>> >>
>>> >> Please post if you have an easy reproducer. We will try to recreate
>>> >> and
>>> > root
>>> >> cause it.
>>> >>
>>> >> On Wed, Oct 26, 2016 at 6:15 AM, Eric Eastman
>>> >> <[email protected]> wrote:
>>> >> > A little more info on this issue.  I did a 24 hour run of my test
>>> >> > using the POSIX FSAL with an ext4 file system as the backstore, and
>>> >> > saw 9 asserts during this test run, all caused by the variable
>>> >> > "refcount" ending up at -1.  The errors seem to be occurring while
>>> >> > running "rm -rf" on a directory with 1000 sub-directories, with
>>> >> > each having 11 files in it.
>>> >> >
>>> >> > This looks to me like a race condition and I am having issues
>>> >> > finding the root cause reading through the source code.  There are
>>> >> > notes from commit e7307c5, dated Jan 5 2016,  on "Resolve race
>>> >> > between get_state_owner and dec_state_owner_ref differently"  so
>>> >> > this looks like an area that there has been issues before.
>>> >> >
>>> >> > If anyone has an idea on what the root problem is or where to look,
>>> >> > please let me know, as we cannot use Ganesha NFS if it is going to
>>> >> > assert during production.
>>> >> >
>>> >> > Thanks,
>>> >> > Eric
>>> >> >
>>> >> > On Thu, Oct 20, 2016 at 1:22 AM, Eric Eastman
>>> >> > <[email protected]> wrote:
>>> >> >> While testing Ganesha NFS V2.4.0.3 using the CEPH FSAL to a ceph
>>> >> >> file system, I am seeing the ganesha.nfsd process die due to an
>>> >> >> assert call multiple times per hour.  I have also seen it die at
>>> >> >> the same place in the code using the VFS FSAL with a ext4 file
>>> >> >> system, but it dies much less often.
>>> >> >>
>>> >> >> It is dying at line 917 in src/SAL/state_misc.c, which is called
>>> >> >> by src/SAL/state_misc.c at line 1010.  The assert call is in
>>> >> >> dec_state_owner_ref() at the line:
>>> >> >>
>>> >> >>        assert(refcount > 0);
>>> >> >>
>>> >> >> Looking at the core files and adding in some debugging code
>>> >> >> confirms that refcount is -1 when the assert call is made.
>>> >> >>
>>> >> >> It looks like the owner count is trying to go to -1 in
>>> >> >> uncache_nfs4_owner(), but as it occurs only on occasions, I think
>>> >> >> it is a race condition.
>>> >> >>
>>> >> >> Info on the build:
>>> >> >>
>>> >> >> Host OS is Ubuntu 14.04 with a 4.8.2 x86_64 kernel on a 8
>>> >> >> processor system
>>> >> >>
>>> >> >> Cmake command:
>>> >> >> # cmake -DCMAKE_INSTALL_PREFIX=/opt/keeper -
>>> >> DALLOCATOR=jemalloc
>>> >> >> -DUSE_ADMIN_TOOLS=ON -DUSE_DBUS=ON ../src
>>> >> >>
>>> >> >> # ganesha.nfsd -v
>>> >> >> ganesha.nfsd compiled on Oct 17 2016 at 16:50:18 Release =
>>> >> >> V2.4.0.3 Release comment = GANESHA file server is 64 bits
>>> >> >> compliant and supports NFS v3,4.0,4.1 (pNFS) and 9P Git HEAD =
>>> >> >> 0f55a9a97a4bf232fb0e42542e4ca7491fbf84ce
>>> >> >> Git Describe = V2.4.0.3-0-g0f55a9a
>>> >> >>
>>> >> >> # ceph -v
>>> >> >> ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
>>> >> >>
>>> >> >> # cat ganesha.conf
>>> >> >> LOG {
>>> >> >>     components {
>>> >> >>        ALL = INFO;
>>> >> >>     }
>>> >> >> }
>>> >> >>
>>> >> >> EXPORT_DEFAULTS {
>>> >> >> SecType = none, sys;
>>> >> >> Protocols = 3, 4;
>>> >> >> Transports = TCP;
>>> >> >> }
>>> >> >>
>>> >> >> # define CephFS export
>>> >> >> EXPORT {
>>> >> >>     Export_ID = 42;
>>> >> >>     Path = /top;
>>> >> >>     Pseudo = /top;
>>> >> >>     Access_Type = RW;
>>> >> >>     Squash = No_Root_Squash;
>>> >> >>     FSAL {
>>> >> >>         Name = CEPH;
>>> >> >>     }
>>> >> >> }
>>> >> >>
>>> >> >> The VFS export for the ext4 tests was:
>>> >> >>
>>> >> >> # define CephFS export
>>> >> >> EXPORT {
>>> >> >>     Export_ID = 43;
>>> >> >>     Path = /var/top;
>>> >> >>     Pseudo = /var/top;
>>> >> >>     Access_Type = RW;
>>> >> >>     Squash = No_Root_Squash;
>>> >> >>     FSAL {
>>> >> >>         Name = VFS;
>>> >> >>     }
>>> >> >> }
>>> >> >>
>>> >> >> The test was 2 Ubuntu 14.04 NFS clients each having 6 processes,
>>> >> >> writing 11,000 256k files in separate directory trees with 11
>>> >> >> files per lowest level node. On each Ubuntu client, 3 processes
>>> >> >> wrote to a NFS 3 mount and 3 wrote to a NFS 4 mount. The files are
>>> >> >> then read and verified, deleted, and the test restarts.
>>> >> >>
>>> >> >> Regards,
>>> >> >> Eric
>>> >> >
>>> >> > -------------------------------------------------------------------
>>> >> > ---
>>> >> > -------- The Command Line: Reinvented for Modern Developers Did the
>>> >> > resurgence of CLI tooling catch you by surprise?
>>> >> > Reconnect with the command line and become more productive.
>>> >> > Learn the new .NET and ASP.NET CLI. Get your free copy!
>>> >> > http://sdm.link/telerik
>>> >> > _______________________________________________
>>> >> > Nfs-ganesha-devel mailing list
>>> >> > [email protected]
>>> >> > https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel
>>> >>
>>> >>
>>> > ----------------------------------------------------------------------
>>> > ------
>>> > --
>>> >> The Command Line: Reinvented for Modern Developers Did the
>>> resurgence
>>> >> of CLI tooling catch you by surprise?
>>> >> Reconnect with the command line and become more productive.
>>> >> Learn the new .NET and ASP.NET CLI. Get your free copy!
>>> >> http://sdm.link/telerik
>>> >> _______________________________________________
>>> >> Nfs-ganesha-devel mailing list
>>> >> [email protected]
>>> >> https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel
>>> >
>>> >
>>> > ---
>>> > This email has been checked for viruses by Avast antivirus software.
>>> > https://www.avast.com/antivirus
>>> >
>>
>>
>> ---
>> This email has been checked for viruses by Avast antivirus software.
>> https://www.avast.com/antivirus
>>

------------------------------------------------------------------------------
_______________________________________________
Nfs-ganesha-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel

Reply via email to