[Kernel-packages] [Bug 1797314] Re: fscache: bad refcounting in fscache_op_complete leads to OOPS
Yes the test results indicates that data was read cached using fscache and 960948 ops have gone through the change. we are not planning to test with 4.4.0-139 kernel as I don't have a system setup for this kernel. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1797314 Title: fscache: bad refcounting in fscache_op_complete leads to OOPS Status in linux package in Ubuntu: In Progress Status in linux source package in Xenial: Fix Committed Status in linux source package in Bionic: Fix Committed Status in linux source package in Cosmic: In Progress Bug description: SRU Justification - [Impact] A kernel BUG is sometimes observed when using fscache: [4740718.880898] FS-Cache: [4740718.880920] FS-Cache: Assertion failed [4740718.880934] FS-Cache: 0 > 0 is false [4740718.881001] [ cut here ] [4740718.881017] kernel BUG at /usr/src/linux-4.4.0/fs/fscache/operation.c:449! [4740718.881040] invalid opcode: [#1] SMP [4740718.892659] Call Trace: [4740718.893506] [] cachefiles_read_copier+0x3a9/0x410 [cachefiles] [4740718.894374] [] fscache_op_work_func+0x22/0x50 [fscache] [4740718.895180] [] process_one_work+0x150/0x3f0 [4740718.895966] [] worker_thread+0x11a/0x470 [4740718.896753] [] ? __schedule+0x359/0x980 [4740718.897783] [] ? rescuer_thread+0x310/0x310 [4740718.898581] [] kthread+0xd6/0xf0 [4740718.899469] [] ? kthread_park+0x60/0x60 [4740718.900477] [] ret_from_fork+0x3f/0x70 [4740718.901514] [] ? kthread_park+0x60/0x60 [Problem] In include/linux/fscache-cache.h, fscache_retrieval_complete reads, in part: atomic_sub(n_pages, &op->n_pages); if (atomic_read(&op->n_pages) <= 0) fscache_op_complete(&op->op, true); The code is using atomic_sub followed by an atomic_read. This causes two threads doing a decrement of pages to race with each other seeing the op->refcount <= 0 at same time, and end up calling fscache_op_complete in both the threads leading to the OOPS. [Fix] The fix is trivial to use atomic_sub_return instead of two calls. [Testcase] I believe the user has tested the patch successfully on their fscache/cachefiles setup. [Regression Potential] Limited to fscache. Small, comprehensible change. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1797314/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1793430] Re: Page leaking in cachefiles_read_backing_file while vmscan is active
test log attached ** Attachment added: "test log for fscache page leak." https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1793430/+attachment/5199707/+files/fscache_page_leak_test.log -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1793430 Title: Page leaking in cachefiles_read_backing_file while vmscan is active Status in linux package in Ubuntu: Fix Committed Status in linux source package in Xenial: Fix Committed Status in linux source package in Bionic: Fix Committed Status in linux source package in Cosmic: Fix Committed Bug description: SRU Justification - [Description] In a heavily loaded system where the system pagecache is nearing memory limits and fscache is enabled, pages can be leaked by fscache while trying read pages from cachefiles backend. This can happen because two applications can be reading same page from a single mount, two threads can be trying to read the backing page at same time. This results in one of the thread finding that a page for the backing file or netfs file is already in the radix tree. During the error handling cachefiles does not cleanup the reference on backing page, leading to page leak. [Fix] The fix is straightforward, to decrement the reference when error is encounterd. [Testing] A user has tested the fix using following method for 12+ hrs. 1) mkdir -p /mnt/nfs ; mount -o vers=3,fsc :/export /mnt/nfs 2) create 1 files of 2.8MB in a NFS mount. 3) start a thread to simulate heavy VM presssure (while true ; do echo 3 > /proc/sys/vm/drop_caches ; sleep 1 ; done)& 4) start multiple parallel reader for data set at same time find /mnt/nfs -type f | xargs -P 80 cat > /dev/null & find /mnt/nfs -type f | xargs -P 80 cat > /dev/null & find /mnt/nfs -type f | xargs -P 80 cat > /dev/null & .. .. find /mnt/nfs -type f | xargs -P 80 cat > /dev/null & find /mnt/nfs -type f | xargs -P 80 cat > /dev/null & 5) finally check using cat /proc/fs/fscache/stats | grep -i pages ; free -h , cat /proc/meminfo and page-types -r -b lru to ensure all pages are freed. [Regression Potential] Limited to cachefiles. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1793430/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1793430] Re: Page leaking in cachefiles_read_backing_file while vmscan is active
Yes we were able to test with 4.15.0-37-generic kernel aswell. Your sosreport has been generated and saved in: /tmp/sosreport-kmodukuri.00195310-20181010154537.tar.xz The checksum is: 8c057643e9995694678915ce550e422c Please send this file to your support representative. ** Attachment added: "sos report after testing on 4.15.0-37-generic" https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1793430/+attachment/5199706/+files/sosreport-kmodukuri.00195310-20181010154537.tar.xz -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1793430 Title: Page leaking in cachefiles_read_backing_file while vmscan is active Status in linux package in Ubuntu: Fix Committed Status in linux source package in Xenial: Fix Committed Status in linux source package in Bionic: Fix Committed Status in linux source package in Cosmic: Fix Committed Bug description: SRU Justification - [Description] In a heavily loaded system where the system pagecache is nearing memory limits and fscache is enabled, pages can be leaked by fscache while trying read pages from cachefiles backend. This can happen because two applications can be reading same page from a single mount, two threads can be trying to read the backing page at same time. This results in one of the thread finding that a page for the backing file or netfs file is already in the radix tree. During the error handling cachefiles does not cleanup the reference on backing page, leading to page leak. [Fix] The fix is straightforward, to decrement the reference when error is encounterd. [Testing] A user has tested the fix using following method for 12+ hrs. 1) mkdir -p /mnt/nfs ; mount -o vers=3,fsc :/export /mnt/nfs 2) create 1 files of 2.8MB in a NFS mount. 3) start a thread to simulate heavy VM presssure (while true ; do echo 3 > /proc/sys/vm/drop_caches ; sleep 1 ; done)& 4) start multiple parallel reader for data set at same time find /mnt/nfs -type f | xargs -P 80 cat > /dev/null & find /mnt/nfs -type f | xargs -P 80 cat > /dev/null & find /mnt/nfs -type f | xargs -P 80 cat > /dev/null & .. .. find /mnt/nfs -type f | xargs -P 80 cat > /dev/null & find /mnt/nfs -type f | xargs -P 80 cat > /dev/null & 5) finally check using cat /proc/fs/fscache/stats | grep -i pages ; free -h , cat /proc/meminfo and page-types -r -b lru to ensure all pages are freed. [Regression Potential] Limited to cachefiles. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1793430/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1793430] Re: Page leaking in cachefiles_read_backing_file while vmscan is active
All tests passed from Nvidia testing for fscache. Your sosreport has been generated and saved in: /tmp/sosreport-tid1870983.00195310-20181009172638.tar.xz The checksum is: 2f7b671685cf8116920efb63e6397fe2 ** Attachment added: "sosreport-tid1870983.00195310-20181009172638.tar.xz" https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1793430/+attachment/5199600/+files/sosreport-tid1870983.00195310-20181009172638.tar.xz -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1793430 Title: Page leaking in cachefiles_read_backing_file while vmscan is active Status in linux package in Ubuntu: Fix Committed Status in linux source package in Xenial: Fix Committed Status in linux source package in Bionic: Fix Committed Status in linux source package in Cosmic: Fix Committed Bug description: SRU Justification - [Description] In a heavily loaded system where the system pagecache is nearing memory limits and fscache is enabled, pages can be leaked by fscache while trying read pages from cachefiles backend. This can happen because two applications can be reading same page from a single mount, two threads can be trying to read the backing page at same time. This results in one of the thread finding that a page for the backing file or netfs file is already in the radix tree. During the error handling cachefiles does not cleanup the reference on backing page, leading to page leak. [Fix] The fix is straightforward, to decrement the reference when error is encounterd. [Testing] A user has tested the fix using following method for 12+ hrs. 1) mkdir -p /mnt/nfs ; mount -o vers=3,fsc :/export /mnt/nfs 2) create 1 files of 2.8MB in a NFS mount. 3) start a thread to simulate heavy VM presssure (while true ; do echo 3 > /proc/sys/vm/drop_caches ; sleep 1 ; done)& 4) start multiple parallel reader for data set at same time find /mnt/nfs -type f | xargs -P 80 cat > /dev/null & find /mnt/nfs -type f | xargs -P 80 cat > /dev/null & find /mnt/nfs -type f | xargs -P 80 cat > /dev/null & .. .. find /mnt/nfs -type f | xargs -P 80 cat > /dev/null & find /mnt/nfs -type f | xargs -P 80 cat > /dev/null & 5) finally check using cat /proc/fs/fscache/stats | grep -i pages ; free -h , cat /proc/meminfo and page-types -r -b lru to ensure all pages are freed. [Regression Potential] Limited to cachefiles. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1793430/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1793430] Re: Page leaking in cachefiles_read_backing_file while vmscan is active
we are trying to get hardware resource to test this before 10/9. will keep you posted. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1793430 Title: Page leaking in cachefiles_read_backing_file while vmscan is active Status in linux package in Ubuntu: Fix Committed Status in linux source package in Xenial: Fix Committed Status in linux source package in Bionic: Fix Committed Status in linux source package in Cosmic: Fix Committed Bug description: SRU Justification - [Description] In a heavily loaded system where the system pagecache is nearing memory limits and fscache is enabled, pages can be leaked by fscache while trying read pages from cachefiles backend. This can happen because two applications can be reading same page from a single mount, two threads can be trying to read the backing page at same time. This results in one of the thread finding that a page for the backing file or netfs file is already in the radix tree. During the error handling cachefiles does not cleanup the reference on backing page, leading to page leak. [Fix] The fix is straightforward, to decrement the reference when error is encounterd. [Testing] A user has tested the fix using following method for 12+ hrs. 1) mkdir -p /mnt/nfs ; mount -o vers=3,fsc :/export /mnt/nfs 2) create 1 files of 2.8MB in a NFS mount. 3) start a thread to simulate heavy VM presssure (while true ; do echo 3 > /proc/sys/vm/drop_caches ; sleep 1 ; done)& 4) start multiple parallel reader for data set at same time find /mnt/nfs -type f | xargs -P 80 cat > /dev/null & find /mnt/nfs -type f | xargs -P 80 cat > /dev/null & find /mnt/nfs -type f | xargs -P 80 cat > /dev/null & .. .. find /mnt/nfs -type f | xargs -P 80 cat > /dev/null & find /mnt/nfs -type f | xargs -P 80 cat > /dev/null & 5) finally check using cat /proc/fs/fscache/stats | grep -i pages ; free -h , cat /proc/meminfo and page-types -r -b lru to ensure all pages are freed. [Regression Potential] Limited to cachefiles. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1793430/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1776277] [NEW] fscache cookie refcount updated incorrectly during fscache object allocation
Public bug reported: == SRU Justification == [Impact] Oops during heavy NFS + FSCache + Cachefiles use: kernel BUG at /build/linux-Y09MKI/linux-4.4.0/fs/fscache/internal.h:321! kernel BUG at /build/linux-Y09MKI/linux-4.4.0/fs/fscache/cookie.c:639! [Cause] 1)Two threads are trying to do operate on a cookie and two objects. 2a)One thread tries to unmount the filesystem and in process goes over a huge list of objects marking them dead and deleting the objects. cookie->usage is also decremented in following path nfs_fscache_release_super_cookie -> __fscache_relinquish_cookie ->__fscache_cookie_put ->BUG_ON(atomic_read(&cookie->usage) <= 0); 2b)second thread tries to lookup an object for reading data in following path fscache_alloc_object 1) cachefiles_alloc_object -> fscache_object_init -> assign cookie, but usage not bumped. 2) fscache_attach_object -> fails in cant_attach_object because the cookie's backing object or cookie's->parent object are going away 3)fscache_put_object -> cachefiles_put_object ->fscache_object_destroy ->fscache_cookie_put ->BUG_ON(atomic_read(&cookie->usage) <= 0); [Fix] Bump up the cookie usage in fscache_object_init, when it is first being assigned a cookie atomically such that the cookie is added and bumped up if its refcount is not zero. remove the assignment in the attach_object. [Testcase] A user has run ~100 hours of NFS stress tests and not seen this bug recur. [Regression Potential] - Limited to fscache/cachefiles. ** Affects: linux (Ubuntu) Importance: Undecided Status: New ** Description changed: == SRU Justification == [Impact] Oops during heavy NFS + FSCache + Cachefiles use: - kernel BUG at /build/linux-Y09MKI/linux-4.4.0/fs/fscache/internal.h:321! - kernel BUG at /build/linux-Y09MKI/linux-4.4.0/fs/fscache/cookie.c:639! + kernel BUG at /build/linux-Y09MKI/linux-4.4.0/fs/fscache/internal.h:321! + kernel BUG at /build/linux-Y09MKI/linux-4.4.0/fs/fscache/cookie.c:639! [Cause] - 1)Two threads are trying to do operate on a cookie and two objects. - 2a)One thread tries to unmount the filesystem and in process goes over -a huge list of objects marking them dead and deleting the objects. -cookie->usage is also decremented in - nfs_fscache_release_super_cookie --> __fscache_relinquish_cookie - ->__fscache_cookie_put - ->BUG_ON(atomic_read(&cookie->usage) <= 0); + 1)Two threads are trying to do operate on a cookie and two objects. + 2a)One thread tries to unmount the filesystem and in process goes over + a huge list of objects marking them dead and deleting the objects. + cookie->usage is also decremented in following path + nfs_fscache_release_super_cookie + -> __fscache_relinquish_cookie + ->__fscache_cookie_put + ->BUG_ON(atomic_read(&cookie->usage) <= 0); - 2b)second thread tries to lookup an object for reading data in fscache_alloc_object - 1) cachefiles_alloc_object-> fscache_object_init -> assign cookie, but usage not bumped. - 2) fscache_attach_object -> fails in cant_attach_object because the cookie's backing object - or cookie's->parent object are going away - 3)fscache_put_object --> cachefiles_put_object -->fscache_object_destroy -->fscache_cookie_put -->BUG_ON(atomic_read(&cookie->usage) <= 0); + 2b)second thread tries to lookup an object for reading data in + following path + + fscache_alloc_object + 1) cachefiles_alloc_object + -> fscache_object_init + -> assign cookie, but usage not bumped. + 2) fscache_attach_object -> fails in cant_attach_object because the + cookie's backing object or cookie's->parent object are going away + 3)fscache_put_object + -> cachefiles_put_object + ->fscache_object_destroy + ->fscache_cookie_put + ->BUG_ON(atomic_read(&cookie->usage) <= 0); [Fix] - Bump up the cookie usage in fscache_object_init, - when it is first being assigned a cookie atomically such that the cookie - is added and bumped up if its refcount is not zero. - remove the assignment in the attach_object. + Bump up the cookie usage in fscache_object_init, + when it is first being assigned a cookie atomically such that the cookie + is added and bumped up if its refcount is not zero. + remove the assignment in the attach_object. [Testcase] A user has run ~100 hours of NFS stress tests and not seen this bug recur. [Regression Potential] - - Limited to fscache/cachefiles. + - Limited to fscache/cachefiles. -- You received this bug notification because
[Kernel-packages] [Bug 1776254] [NEW] CacheFiles: Error: Overlong wait for old active object to go away.
Public bug reported: == SRU Justification == [Impact] Oops during heavy NFS + FSCache + Cachefiles use: CacheFiles: Error: Overlong wait for old active object to go away. BUG: unable to handle kernel NULL pointer dereference at 0002 CacheFiles: Error: Object already active kernel BUG at fs/cachefiles/namei.c:163! [Cause] In a heavily loaded system with big files being read and truncated, an fscache object for a cookie is being dropped and a new object being looked. The new object being looked for has to wait for the old object to go away before the new object is moved to active state. [Fix] Clear the flag 'CACHEFILES_OBJECT_ACTIVE' for the new object when retrying the object lookup. Remove the BUG() for the case where the old object is still being dropped and convert to WARN() [Testcase] A user has run ~100 hours of NFS stress tests and not seen this bug recur. [Regression Potential] - Limited to fscache/cachefiles. ** Affects: linux (Ubuntu) Importance: Undecided Status: New -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1776254 Title: CacheFiles: Error: Overlong wait for old active object to go away. Status in linux package in Ubuntu: New Bug description: == SRU Justification == [Impact] Oops during heavy NFS + FSCache + Cachefiles use: CacheFiles: Error: Overlong wait for old active object to go away. BUG: unable to handle kernel NULL pointer dereference at 0002 CacheFiles: Error: Object already active kernel BUG at fs/cachefiles/namei.c:163! [Cause] In a heavily loaded system with big files being read and truncated, an fscache object for a cookie is being dropped and a new object being looked. The new object being looked for has to wait for the old object to go away before the new object is moved to active state. [Fix] Clear the flag 'CACHEFILES_OBJECT_ACTIVE' for the new object when retrying the object lookup. Remove the BUG() for the case where the old object is still being dropped and convert to WARN() [Testcase] A user has run ~100 hours of NFS stress tests and not seen this bug recur. [Regression Potential] - Limited to fscache/cachefiles. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1776254/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp