[Kernel-packages] [Bug 1797314] Re: fscache: bad refcounting in fscache_op_complete leads to OOPS

2018-10-31 Thread Kiran Kumar Modukuri
Yes the test results indicates that data was read cached using fscache
and 960948 ops have gone through the change.

we are not planning to test with 4.4.0-139 kernel as I don't have a
system setup for this kernel.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1797314

Title:
  fscache: bad refcounting in fscache_op_complete leads to OOPS

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Xenial:
  Fix Committed
Status in linux source package in Bionic:
  Fix Committed
Status in linux source package in Cosmic:
  In Progress

Bug description:
  SRU Justification
  -

  [Impact]

  A kernel BUG is sometimes observed when using fscache:
  [4740718.880898] FS-Cache:
  [4740718.880920] FS-Cache: Assertion failed
  [4740718.880934] FS-Cache: 0 > 0 is false
  [4740718.881001] [ cut here ]
  [4740718.881017] kernel BUG at 
/usr/src/linux-4.4.0/fs/fscache/operation.c:449!
  [4740718.881040] invalid opcode:  [#1] SMP

  [4740718.892659] Call Trace:
  [4740718.893506]  [] cachefiles_read_copier+0x3a9/0x410 
[cachefiles]
  [4740718.894374]  [] fscache_op_work_func+0x22/0x50 
[fscache]
  [4740718.895180]  [] process_one_work+0x150/0x3f0
  [4740718.895966]  [] worker_thread+0x11a/0x470
  [4740718.896753]  [] ? __schedule+0x359/0x980
  [4740718.897783]  [] ? rescuer_thread+0x310/0x310
  [4740718.898581]  [] kthread+0xd6/0xf0
  [4740718.899469]  [] ? kthread_park+0x60/0x60
  [4740718.900477]  [] ret_from_fork+0x3f/0x70
  [4740718.901514]  [] ? kthread_park+0x60/0x60

  [Problem]

  In include/linux/fscache-cache.h, fscache_retrieval_complete reads, in
  part:

  atomic_sub(n_pages, &op->n_pages);
  if (atomic_read(&op->n_pages) <= 0)
  fscache_op_complete(&op->op, true);

  The code is using atomic_sub followed by an atomic_read. This causes
  two threads doing a decrement of pages to race with each other seeing
  the op->refcount <= 0 at same time, and end up calling
  fscache_op_complete in both the threads leading to the OOPS.

  [Fix]
  The fix is trivial to use atomic_sub_return instead of two calls.

  [Testcase]
  I believe the user has tested the patch successfully on their 
fscache/cachefiles setup.

  [Regression Potential]
  Limited to fscache. Small, comprehensible change.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1797314/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1793430] Re: Page leaking in cachefiles_read_backing_file while vmscan is active

2018-10-10 Thread Kiran Kumar Modukuri
test log attached

** Attachment added: "test log for fscache page leak."
   
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1793430/+attachment/5199707/+files/fscache_page_leak_test.log

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1793430

Title:
  Page leaking in cachefiles_read_backing_file while vmscan is active

Status in linux package in Ubuntu:
  Fix Committed
Status in linux source package in Xenial:
  Fix Committed
Status in linux source package in Bionic:
  Fix Committed
Status in linux source package in Cosmic:
  Fix Committed

Bug description:
  SRU Justification
  -

  [Description]
  In a heavily loaded system where the system pagecache is nearing memory 
limits and fscache is enabled, pages can be leaked by fscache while trying read 
pages from cachefiles backend. This can happen because two applications can be 
reading same page from a single mount, two threads can be trying to read the 
backing page at same time. This results in one of the thread finding that a 
page for the backing file or netfs file is already in the radix tree. During 
the error handling cachefiles does not cleanup the reference on backing page, 
leading to page leak.
  
  [Fix]
  The fix is straightforward, to decrement the reference when error is 
encounterd.
  
  [Testing]
  A user has tested the fix using following method for 12+ hrs.
  
  1) mkdir -p /mnt/nfs ; mount -o vers=3,fsc :/export /mnt/nfs
  2) create 1 files of 2.8MB in a NFS mount.
  3) start a thread to simulate heavy VM presssure
 (while true ; do echo 3 > /proc/sys/vm/drop_caches ; sleep 1 ; done)&
  4) start multiple parallel reader for data set at same time
 find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
 find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
 find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
 ..
 ..
 find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
 find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
  5) finally check using cat /proc/fs/fscache/stats | grep -i pages ;
 free -h , cat /proc/meminfo and page-types -r -b lru
 to ensure all pages are freed.

  [Regression Potential]
  Limited to cachefiles.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1793430/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1793430] Re: Page leaking in cachefiles_read_backing_file while vmscan is active

2018-10-10 Thread Kiran Kumar Modukuri
Yes we were able to test with 4.15.0-37-generic kernel aswell.

Your sosreport has been generated and saved in:
  /tmp/sosreport-kmodukuri.00195310-20181010154537.tar.xz

The checksum is: 8c057643e9995694678915ce550e422c

Please send this file to your support representative.

** Attachment added: "sos report after testing on 4.15.0-37-generic"
   
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1793430/+attachment/5199706/+files/sosreport-kmodukuri.00195310-20181010154537.tar.xz

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1793430

Title:
  Page leaking in cachefiles_read_backing_file while vmscan is active

Status in linux package in Ubuntu:
  Fix Committed
Status in linux source package in Xenial:
  Fix Committed
Status in linux source package in Bionic:
  Fix Committed
Status in linux source package in Cosmic:
  Fix Committed

Bug description:
  SRU Justification
  -

  [Description]
  In a heavily loaded system where the system pagecache is nearing memory 
limits and fscache is enabled, pages can be leaked by fscache while trying read 
pages from cachefiles backend. This can happen because two applications can be 
reading same page from a single mount, two threads can be trying to read the 
backing page at same time. This results in one of the thread finding that a 
page for the backing file or netfs file is already in the radix tree. During 
the error handling cachefiles does not cleanup the reference on backing page, 
leading to page leak.
  
  [Fix]
  The fix is straightforward, to decrement the reference when error is 
encounterd.
  
  [Testing]
  A user has tested the fix using following method for 12+ hrs.
  
  1) mkdir -p /mnt/nfs ; mount -o vers=3,fsc :/export /mnt/nfs
  2) create 1 files of 2.8MB in a NFS mount.
  3) start a thread to simulate heavy VM presssure
 (while true ; do echo 3 > /proc/sys/vm/drop_caches ; sleep 1 ; done)&
  4) start multiple parallel reader for data set at same time
 find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
 find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
 find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
 ..
 ..
 find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
 find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
  5) finally check using cat /proc/fs/fscache/stats | grep -i pages ;
 free -h , cat /proc/meminfo and page-types -r -b lru
 to ensure all pages are freed.

  [Regression Potential]
  Limited to cachefiles.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1793430/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1793430] Re: Page leaking in cachefiles_read_backing_file while vmscan is active

2018-10-10 Thread Kiran Kumar Modukuri
All tests passed from Nvidia testing for fscache.

Your sosreport has been generated and saved in:
  /tmp/sosreport-tid1870983.00195310-20181009172638.tar.xz

The checksum is: 2f7b671685cf8116920efb63e6397fe2



** Attachment added: "sosreport-tid1870983.00195310-20181009172638.tar.xz"
   
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1793430/+attachment/5199600/+files/sosreport-tid1870983.00195310-20181009172638.tar.xz

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1793430

Title:
  Page leaking in cachefiles_read_backing_file while vmscan is active

Status in linux package in Ubuntu:
  Fix Committed
Status in linux source package in Xenial:
  Fix Committed
Status in linux source package in Bionic:
  Fix Committed
Status in linux source package in Cosmic:
  Fix Committed

Bug description:
  SRU Justification
  -

  [Description]
  In a heavily loaded system where the system pagecache is nearing memory 
limits and fscache is enabled, pages can be leaked by fscache while trying read 
pages from cachefiles backend. This can happen because two applications can be 
reading same page from a single mount, two threads can be trying to read the 
backing page at same time. This results in one of the thread finding that a 
page for the backing file or netfs file is already in the radix tree. During 
the error handling cachefiles does not cleanup the reference on backing page, 
leading to page leak.
  
  [Fix]
  The fix is straightforward, to decrement the reference when error is 
encounterd.
  
  [Testing]
  A user has tested the fix using following method for 12+ hrs.
  
  1) mkdir -p /mnt/nfs ; mount -o vers=3,fsc :/export /mnt/nfs
  2) create 1 files of 2.8MB in a NFS mount.
  3) start a thread to simulate heavy VM presssure
 (while true ; do echo 3 > /proc/sys/vm/drop_caches ; sleep 1 ; done)&
  4) start multiple parallel reader for data set at same time
 find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
 find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
 find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
 ..
 ..
 find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
 find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
  5) finally check using cat /proc/fs/fscache/stats | grep -i pages ;
 free -h , cat /proc/meminfo and page-types -r -b lru
 to ensure all pages are freed.

  [Regression Potential]
  Limited to cachefiles.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1793430/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1793430] Re: Page leaking in cachefiles_read_backing_file while vmscan is active

2018-10-05 Thread Kiran Kumar Modukuri
we are trying to get hardware resource to test this before 10/9. will
keep you posted.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1793430

Title:
  Page leaking in cachefiles_read_backing_file while vmscan is active

Status in linux package in Ubuntu:
  Fix Committed
Status in linux source package in Xenial:
  Fix Committed
Status in linux source package in Bionic:
  Fix Committed
Status in linux source package in Cosmic:
  Fix Committed

Bug description:
  SRU Justification
  -

  [Description]
  In a heavily loaded system where the system pagecache is nearing memory 
limits and fscache is enabled, pages can be leaked by fscache while trying read 
pages from cachefiles backend. This can happen because two applications can be 
reading same page from a single mount, two threads can be trying to read the 
backing page at same time. This results in one of the thread finding that a 
page for the backing file or netfs file is already in the radix tree. During 
the error handling cachefiles does not cleanup the reference on backing page, 
leading to page leak.
  
  [Fix]
  The fix is straightforward, to decrement the reference when error is 
encounterd.
  
  [Testing]
  A user has tested the fix using following method for 12+ hrs.
  
  1) mkdir -p /mnt/nfs ; mount -o vers=3,fsc :/export /mnt/nfs
  2) create 1 files of 2.8MB in a NFS mount.
  3) start a thread to simulate heavy VM presssure
 (while true ; do echo 3 > /proc/sys/vm/drop_caches ; sleep 1 ; done)&
  4) start multiple parallel reader for data set at same time
 find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
 find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
 find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
 ..
 ..
 find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
 find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
  5) finally check using cat /proc/fs/fscache/stats | grep -i pages ;
 free -h , cat /proc/meminfo and page-types -r -b lru
 to ensure all pages are freed.

  [Regression Potential]
  Limited to cachefiles.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1793430/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1776277] [NEW] fscache cookie refcount updated incorrectly during fscache object allocation

2018-06-11 Thread Kiran Kumar Modukuri
Public bug reported:

== SRU Justification ==

[Impact]
Oops during heavy NFS + FSCache + Cachefiles use:

 kernel BUG at /build/linux-Y09MKI/linux-4.4.0/fs/fscache/internal.h:321!
 kernel BUG at /build/linux-Y09MKI/linux-4.4.0/fs/fscache/cookie.c:639!

[Cause]
 1)Two threads are trying to do operate on a cookie and two objects.
 2a)One thread tries to unmount the filesystem and in process goes over
   a huge list of objects marking them dead and deleting the objects.
   cookie->usage is also decremented in following path
  nfs_fscache_release_super_cookie
   -> __fscache_relinquish_cookie
->__fscache_cookie_put
->BUG_ON(atomic_read(&cookie->usage) <= 0);

 2b)second thread tries to lookup an object for reading data in
following path
 
 fscache_alloc_object
  1) cachefiles_alloc_object
  -> fscache_object_init 
-> assign cookie, but usage not bumped.
 2) fscache_attach_object -> fails in cant_attach_object because the 
cookie's backing object or cookie's->parent object are going away
 3)fscache_put_object
   -> cachefiles_put_object
    ->fscache_object_destroy
      ->fscache_cookie_put
   ->BUG_ON(atomic_read(&cookie->usage) <= 0);
[Fix]
 Bump up the cookie usage in fscache_object_init,
 when it is first being assigned a cookie atomically such that the cookie
 is added and bumped up if its refcount is not zero.
 remove the assignment in the attach_object.

[Testcase]
A user has run ~100 hours of NFS stress tests and not seen this bug recur.

[Regression Potential]
 - Limited to fscache/cachefiles.

** Affects: linux (Ubuntu)
 Importance: Undecided
 Status: New

** Description changed:

  == SRU Justification ==
  
  [Impact]
  Oops during heavy NFS + FSCache + Cachefiles use:
  
-  kernel BUG at /build/linux-Y09MKI/linux-4.4.0/fs/fscache/internal.h:321!
-  kernel BUG at /build/linux-Y09MKI/linux-4.4.0/fs/fscache/cookie.c:639!
+  kernel BUG at /build/linux-Y09MKI/linux-4.4.0/fs/fscache/internal.h:321!
+  kernel BUG at /build/linux-Y09MKI/linux-4.4.0/fs/fscache/cookie.c:639!
  
  [Cause]
-  1)Two threads are trying to do operate on a cookie and two objects.
-  2a)One thread tries to unmount the filesystem and in process goes over
-a huge list of objects marking them dead and deleting the objects. 
-cookie->usage is also decremented in
-   nfs_fscache_release_super_cookie
--> __fscache_relinquish_cookie
-   ->__fscache_cookie_put
-  ->BUG_ON(atomic_read(&cookie->usage) 
<= 0);
+  1)Two threads are trying to do operate on a cookie and two objects.
+  2a)One thread tries to unmount the filesystem and in process goes over
+    a huge list of objects marking them dead and deleting the objects.
+    cookie->usage is also decremented in following path
+   nfs_fscache_release_super_cookie
+    -> __fscache_relinquish_cookie
+ ->__fscache_cookie_put
+ ->BUG_ON(atomic_read(&cookie->usage) <= 0);
  
-  2b)second thread tries to lookup an object for reading data in 
fscache_alloc_object
-  1) cachefiles_alloc_object-> fscache_object_init -> assign cookie, but 
usage not bumped.
-  2) fscache_attach_object -> fails in cant_attach_object because the 
cookie's backing object
-  or cookie's->parent object are going away
-  3)fscache_put_object
--> cachefiles_put_object
-->fscache_object_destroy
-->fscache_cookie_put
-->BUG_ON(atomic_read(&cookie->usage) <= 0);
+  2b)second thread tries to lookup an object for reading data in
+ following path
+  
+  fscache_alloc_object
+   1) cachefiles_alloc_object
+   -> fscache_object_init 
+ -> assign cookie, but usage not bumped.
+  2) fscache_attach_object -> fails in cant_attach_object because the 
+ cookie's backing object or cookie's->parent object are going away
+  3)fscache_put_object
+    -> cachefiles_put_object
+     ->fscache_object_destroy
+       ->fscache_cookie_put
+    ->BUG_ON(atomic_read(&cookie->usage) <= 0);
  [Fix]
-  Bump up the cookie usage in fscache_object_init, 
-  when it is first being assigned a cookie atomically such that the cookie 
-  is added and bumped up if its refcount is not zero.
-  remove the assignment in the attach_object.
+  Bump up the cookie usage in fscache_object_init,
+  when it is first being assigned a cookie atomically such that the cookie
+  is added and bumped up if its refcount is not zero.
+  remove the assignment in the attach_object.
  
  [Testcase]
  A user has run ~100 hours of NFS stress tests and not seen this bug recur.
  
  [Regression Potential]
-  - Limited to fscache/cachefiles.
+  - Limited to fscache/cachefiles.

-- 
You received this bug notification because 

[Kernel-packages] [Bug 1776254] [NEW] CacheFiles: Error: Overlong wait for old active object to go away.

2018-06-11 Thread Kiran Kumar Modukuri
Public bug reported:

== SRU Justification ==

[Impact]
Oops during heavy NFS + FSCache + Cachefiles use:

 CacheFiles: Error: Overlong wait for old active object to go away.
 BUG: unable to handle kernel NULL pointer dereference at 0002

 CacheFiles: Error: Object already active
 kernel BUG at fs/cachefiles/namei.c:163!

[Cause]
  In a heavily loaded system with big files being read and truncated,
  an fscache object for a cookie is being dropped and a new object being looked.
  The new object being looked for has to wait for the old object to go away 
before the
  new object is moved to active state.

[Fix]
 Clear the flag 'CACHEFILES_OBJECT_ACTIVE' for the new object when retrying 
 the object lookup.
 Remove the BUG() for the case where the old object is still being dropped
 and convert to WARN()

[Testcase]
A user has run ~100 hours of NFS stress tests and not seen this bug recur.

[Regression Potential]
 - Limited to fscache/cachefiles.

** Affects: linux (Ubuntu)
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1776254

Title:
  CacheFiles: Error: Overlong wait for old active object to go away.

Status in linux package in Ubuntu:
  New

Bug description:
  == SRU Justification ==

  [Impact]
  Oops during heavy NFS + FSCache + Cachefiles use:

   CacheFiles: Error: Overlong wait for old active object to go away.
   BUG: unable to handle kernel NULL pointer dereference at 0002

   CacheFiles: Error: Object already active
   kernel BUG at fs/cachefiles/namei.c:163!

  [Cause]
In a heavily loaded system with big files being read and truncated,
an fscache object for a cookie is being dropped and a new object being 
looked.
The new object being looked for has to wait for the old object to go away 
before the
new object is moved to active state.

  [Fix]
   Clear the flag 'CACHEFILES_OBJECT_ACTIVE' for the new object when 
retrying 
   the object lookup.
   Remove the BUG() for the case where the old object is still being dropped
   and convert to WARN()

  [Testcase]
  A user has run ~100 hours of NFS stress tests and not seen this bug recur.

  [Regression Potential]
   - Limited to fscache/cachefiles.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1776254/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp