[Kernel-packages] [Bug 1750038] Re: user space process hung in 'D' state waiting for disk io to complete

Daniel Axtens Mon, 07 May 2018 21:12:01 -0700

** Description changed:

+ == SRU Justification ==
+ 
+ [Impact]
+ Occasionally an application gets stuck in "D" state on NFS reads/sync and 
close system calls. All the subsequent operations on the NFS mounts are stuck 
and reboot is required to rectify the situation.
+ 
+ [Fix]
+ Use GPF_NOIO in some allocations in writeback to avoid a deadlock. This is 
upstream in:
+ ae97aa524ef4 ("NFS: Use GFP_NOIO for two allocations in writeback")
+ 
+ [Testcase]
+ See Test scenario in previous description.
+ 
+ A test kernel with this patch was tested heavily (>100hrs of test suite)
+ without issue.
+ 
+ [Regression Potential]
+ This changes memory allocation in NFS to use a different policy. This could 
potentially affect NFS. 
+ 
+ However, the patch is already in Artful and Bionic without issue.
+ 
+ The patch does not apply to Trusty.
+ 
+ == Previous Description ==
+ 
  Using Ubuntu Xenial user reports processes hang in D state waiting for
  disk io.
  
  Ocassionally one of the applications gets into "D" state on NFS
  reads/sync and close system calls. based on the kernel backtraces seems
  to be stuck in kmalloc allocation during cleanup of dirty NFS pages.
  
  All the subsequent operations on the NFS mounts are stuck and reboot is
  required to rectify the situation.
  
  [Test scenario]
  
- 1) Applications running in Docker environment 
- 2) Application have cgroup limits --cpu-shares --memory -shm-limit 
- 3) python and C++ based applications (torch and caffe) 
- 4) Applications read big lmdb files and write results to NFS shares 
- 5) use NFS v3 , hard and fscache is enabled 
- 6) now swap space is configured 
+ 1) Applications running in Docker environment
+ 2) Application have cgroup limits --cpu-shares --memory -shm-limit
+ 3) python and C++ based applications (torch and caffe)
+ 4) Applications read big lmdb files and write results to NFS shares
+ 5) use NFS v3 , hard and fscache is enabled
+ 6) now swap space is configured
  
  This prevents all other I/O activity on that mount to hang.
  
  we are running into this issue more frequently and identified few
  applications causing this problem.
  
  As updated in the description, the problem seems to be happening when
  exercising the stack
  
  try_to_free_mem_cgroup_pages+0xba/0x1a0
  
  we see this with docker containers with cgroup option --memory
  <USER_SPECIFIED_MEM>.
  
  whenever there is a deadlock, we see that the process that is hung has
  reached the maximum cgroup limit, multiple times and typically cleans up
  dirty data and caches to bring the usage under the limit.
  
  This reclaim path happens many times and finally we hit probably a race
  get into deadlock


** Changed in: linux (Ubuntu)
     Assignee: Dragan S. (dragan-s) => Daniel Axtens (daxtens)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1750038

Title:
  user space process hung in 'D' state waiting for disk io to complete

Status in linux package in Ubuntu:
  Incomplete

Bug description:
  == SRU Justification ==

  [Impact]
  Occasionally an application gets stuck in "D" state on NFS reads/sync and 
close system calls. All the subsequent operations on the NFS mounts are stuck 
and reboot is required to rectify the situation.

  [Fix]
  Use GPF_NOIO in some allocations in writeback to avoid a deadlock. This is 
upstream in:
  ae97aa524ef4 ("NFS: Use GFP_NOIO for two allocations in writeback")

  [Testcase]
  See Test scenario in previous description.

  A test kernel with this patch was tested heavily (>100hrs of test
  suite) without issue.

  [Regression Potential]
  This changes memory allocation in NFS to use a different policy. This could 
potentially affect NFS. 

  However, the patch is already in Artful and Bionic without issue.

  The patch does not apply to Trusty.

  == Previous Description ==

  Using Ubuntu Xenial user reports processes hang in D state waiting for
  disk io.

  Ocassionally one of the applications gets into "D" state on NFS
  reads/sync and close system calls. based on the kernel backtraces
  seems to be stuck in kmalloc allocation during cleanup of dirty NFS
  pages.

  All the subsequent operations on the NFS mounts are stuck and reboot
  is required to rectify the situation.

  [Test scenario]

  1) Applications running in Docker environment
  2) Application have cgroup limits --cpu-shares --memory -shm-limit
  3) python and C++ based applications (torch and caffe)
  4) Applications read big lmdb files and write results to NFS shares
  5) use NFS v3 , hard and fscache is enabled
  6) now swap space is configured

  This prevents all other I/O activity on that mount to hang.

  we are running into this issue more frequently and identified few
  applications causing this problem.

  As updated in the description, the problem seems to be happening when
  exercising the stack

  try_to_free_mem_cgroup_pages+0xba/0x1a0

  we see this with docker containers with cgroup option --memory
  <USER_SPECIFIED_MEM>.

  whenever there is a deadlock, we see that the process that is hung has
  reached the maximum cgroup limit, multiple times and typically cleans
  up dirty data and caches to bring the usage under the limit.

  This reclaim path happens many times and finally we hit probably a
  race get into deadlock

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1750038/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1750038] Re: user space process hung in 'D' state waiting for disk io to complete

Reply via email to