Re: [PATCH] writeback: fix range_cyclic writeback vs writepages deadlock
Hi We believe we have seen this deadlock issue on our XFS system and currently are trying the patch (aware of the disclaimer that it may not be production ready). However trying to repo this issue is very hard. I was interested in Dave Chinner saying that he had seen a repeatable deadlock and hoped that meant that there was either a process or method that could provoke this situation from happening. Currently we see this deadlock happen about once every 1-2 months and hence trying to speed the iteration up. Regards Gavin Will
Re: [PATCH] writeback: fix range_cyclic writeback vs writepages deadlock
On Fri, Oct 05, 2018 at 12:46:40PM -0700, Andrew Morton wrote: > On Fri, 5 Oct 2018 15:45:26 +1000 Dave Chinner wrote: > > > From: Dave Chinner > > > > We've recently seen a workload on XFS filesystems with a repeatable > > deadlock between background writeback and a multi-process > > application doing concurrent writes and fsyncs to a small range of a > > file. > > > > ... > > > > Signed-off-by: Dave Chinner > > Reviewed-by: Jan Kara > > Not a serious enough problem for a -stable backport? Don't have enough evidence to say one way or another. The reported incident was from a RHEL 7 kernel, so the bug has been there for years in one form or another, but it's only ever been triggered by this one-off custom workload. I haven't done any analysis on older kernels, nor have I looked to see if there's any gotchas that a stable backport might encounter. And I tend not to change stuff in a path that is critical to data integrity without at least doing enough due diligence to suggest a stable backport would be fine. You can mark it for stable backports if you want, but I'm not prepared to because I haven't done the work necessary to ensure it's safe to do so. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH] writeback: fix range_cyclic writeback vs writepages deadlock
On Fri, Oct 05, 2018 at 12:46:40PM -0700, Andrew Morton wrote: > On Fri, 5 Oct 2018 15:45:26 +1000 Dave Chinner wrote: > > > From: Dave Chinner > > > > We've recently seen a workload on XFS filesystems with a repeatable > > deadlock between background writeback and a multi-process > > application doing concurrent writes and fsyncs to a small range of a > > file. > > > > ... > > > > Signed-off-by: Dave Chinner > > Reviewed-by: Jan Kara > > Not a serious enough problem for a -stable backport? Don't have enough evidence to say one way or another. The reported incident was from a RHEL 7 kernel, so the bug has been there for years in one form or another, but it's only ever been triggered by this one-off custom workload. I haven't done any analysis on older kernels, nor have I looked to see if there's any gotchas that a stable backport might encounter. And I tend not to change stuff in a path that is critical to data integrity without at least doing enough due diligence to suggest a stable backport would be fine. You can mark it for stable backports if you want, but I'm not prepared to because I haven't done the work necessary to ensure it's safe to do so. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH] writeback: fix range_cyclic writeback vs writepages deadlock
On Fri, 5 Oct 2018 15:45:26 +1000 Dave Chinner wrote: > From: Dave Chinner > > We've recently seen a workload on XFS filesystems with a repeatable > deadlock between background writeback and a multi-process > application doing concurrent writes and fsyncs to a small range of a > file. > > ... > > Signed-off-by: Dave Chinner > Reviewed-by: Jan Kara Not a serious enough problem for a -stable backport?
Re: [PATCH] writeback: fix range_cyclic writeback vs writepages deadlock
On Fri, 5 Oct 2018 15:45:26 +1000 Dave Chinner wrote: > From: Dave Chinner > > We've recently seen a workload on XFS filesystems with a repeatable > deadlock between background writeback and a multi-process > application doing concurrent writes and fsyncs to a small range of a > file. > > ... > > Signed-off-by: Dave Chinner > Reviewed-by: Jan Kara Not a serious enough problem for a -stable backport?
[PATCH] writeback: fix range_cyclic writeback vs writepages deadlock
From: Dave Chinner We've recently seen a workload on XFS filesystems with a repeatable deadlock between background writeback and a multi-process application doing concurrent writes and fsyncs to a small range of a file. range_cyclic writeback Process 1 Process 2 xfs_vm_writepages write_cache_pages writeback_index = 2 cycled = 0 find page 2 dirty lock Page 2 ->writepage page 2 writeback page 2 clean page 2 added to bio no more pages write() locks page 1 dirties page 1 locks page 2 dirties page 1 fsync() xfs_vm_writepages write_cache_pages start index 0 find page 1 towrite lock Page 1 ->writepage page 1 writeback page 1 clean page 1 added to bio find page 2 towrite lock Page 2 page 2 is writeback write() locks page 1 dirties page 1 fsync() xfs_vm_writepages write_cache_pages start index 0 !done && !cycled sets index to 0, restarts lookup find page 1 dirty find page 1 towrite lock Page 1 page 1 is writeback lock Page 1 DEADLOCK because: - process 1 needs page 2 writeback to complete to make enough progress to issue IO pending for page 1 - writeback needs page 1 writeback to complete so process 2 can progress and unlock the page it is blocked on, then it can issue the IO pending for page 2 - process 2 can't make progress until process 1 issues IO for page 1 The underlying cause of the problem here is that range_cyclic writeback is processing pages in descending index order as we hold higher index pages in a structure controlled from above write_cache_pages(). The write_cache_pages() caller needs to be able to submit these pages for IO before write_cache_pages restarts writeback at mapping index 0 to avoid wcp inverting the page lock/writeback wait order. generic_writepages() is not susceptible to this bug as it has no private context held across write_cache_pages() - filesystems using this infrastructure always submit pages in ->writepage immediately and so there is no problem with range_cyclic going back to mapping index 0. However: mpage_writepages() has a private bio context, exofs_writepages() has page_collect fuse_writepages() has fuse_fill_wb_data nfs_writepages() has nfs_pageio_descriptor xfs_vm_writepages() has xfs_writepage_ctx All of these ->writepages implementations can hold pages under writeback in their private structures until write_cache_pages() returns, and hence they are all susceptible to this deadlock. Also worth noting is that ext4 has it's own bastardised version of write_cache_pages() and so it /may/ have an equivalent deadlock. I looked at the code long enough to understand that it has a similar retry loop for range_cyclic writeback reaching the end of the file and then promptly ran away before my eyes bled too much. I'll leave it for the ext4 developers to determine if their code is actually has this deadlock and how to fix it if it has. There's a few ways I can see avoid this deadlock. There's probably more, but these are the first I've though of: 1. get rid of range_cyclic altogether 2. range_cyclic always stops at EOF, and we start again from writeback index 0 on the next call into write_cache_pages() 2a. wcp also returns EAGAIN to ->writepages implementations to indicate range cyclic has hit EOF. writepages implementations can then flush the current context and call wpc again to continue. i.e. lift the retry into the ->writepages implementation 3. range_cyclic uses trylock_page() rather than lock_page(), and it skips pages it can't lock without blocking. It will already do this for pages under writeback, so this seems like a no-brainer 3a. all non-WB_SYNC_ALL writeback uses trylock_page() to avoid blocking as per pages under writeback. I don't think #1 is an option -
[PATCH] writeback: fix range_cyclic writeback vs writepages deadlock
From: Dave Chinner We've recently seen a workload on XFS filesystems with a repeatable deadlock between background writeback and a multi-process application doing concurrent writes and fsyncs to a small range of a file. range_cyclic writeback Process 1 Process 2 xfs_vm_writepages write_cache_pages writeback_index = 2 cycled = 0 find page 2 dirty lock Page 2 ->writepage page 2 writeback page 2 clean page 2 added to bio no more pages write() locks page 1 dirties page 1 locks page 2 dirties page 1 fsync() xfs_vm_writepages write_cache_pages start index 0 find page 1 towrite lock Page 1 ->writepage page 1 writeback page 1 clean page 1 added to bio find page 2 towrite lock Page 2 page 2 is writeback write() locks page 1 dirties page 1 fsync() xfs_vm_writepages write_cache_pages start index 0 !done && !cycled sets index to 0, restarts lookup find page 1 dirty find page 1 towrite lock Page 1 page 1 is writeback lock Page 1 DEADLOCK because: - process 1 needs page 2 writeback to complete to make enough progress to issue IO pending for page 1 - writeback needs page 1 writeback to complete so process 2 can progress and unlock the page it is blocked on, then it can issue the IO pending for page 2 - process 2 can't make progress until process 1 issues IO for page 1 The underlying cause of the problem here is that range_cyclic writeback is processing pages in descending index order as we hold higher index pages in a structure controlled from above write_cache_pages(). The write_cache_pages() caller needs to be able to submit these pages for IO before write_cache_pages restarts writeback at mapping index 0 to avoid wcp inverting the page lock/writeback wait order. generic_writepages() is not susceptible to this bug as it has no private context held across write_cache_pages() - filesystems using this infrastructure always submit pages in ->writepage immediately and so there is no problem with range_cyclic going back to mapping index 0. However: mpage_writepages() has a private bio context, exofs_writepages() has page_collect fuse_writepages() has fuse_fill_wb_data nfs_writepages() has nfs_pageio_descriptor xfs_vm_writepages() has xfs_writepage_ctx All of these ->writepages implementations can hold pages under writeback in their private structures until write_cache_pages() returns, and hence they are all susceptible to this deadlock. Also worth noting is that ext4 has it's own bastardised version of write_cache_pages() and so it /may/ have an equivalent deadlock. I looked at the code long enough to understand that it has a similar retry loop for range_cyclic writeback reaching the end of the file and then promptly ran away before my eyes bled too much. I'll leave it for the ext4 developers to determine if their code is actually has this deadlock and how to fix it if it has. There's a few ways I can see avoid this deadlock. There's probably more, but these are the first I've though of: 1. get rid of range_cyclic altogether 2. range_cyclic always stops at EOF, and we start again from writeback index 0 on the next call into write_cache_pages() 2a. wcp also returns EAGAIN to ->writepages implementations to indicate range cyclic has hit EOF. writepages implementations can then flush the current context and call wpc again to continue. i.e. lift the retry into the ->writepages implementation 3. range_cyclic uses trylock_page() rather than lock_page(), and it skips pages it can't lock without blocking. It will already do this for pages under writeback, so this seems like a no-brainer 3a. all non-WB_SYNC_ALL writeback uses trylock_page() to avoid blocking as per pages under writeback. I don't think #1 is an option -