Re: [PATCH] writeback: fix range_cyclic writeback vs writepages deadlock

2019-03-26 Thread Gavin Will
Hi

We believe we have seen this deadlock issue on our XFS system and
currently are trying the patch (aware of the disclaimer that it may
not be production ready). However trying to repo this issue is very
hard. I was interested in Dave Chinner saying that he had seen a
repeatable deadlock and hoped that meant that there was either a
process or method that could provoke this situation from happening.
Currently we see this deadlock happen about once every 1-2 months and
hence trying to speed the iteration up.

Regards
Gavin Will


Re: [PATCH] writeback: fix range_cyclic writeback vs writepages deadlock

2018-10-05 Thread Dave Chinner
On Fri, Oct 05, 2018 at 12:46:40PM -0700, Andrew Morton wrote:
> On Fri,  5 Oct 2018 15:45:26 +1000 Dave Chinner  wrote:
> 
> > From: Dave Chinner 
> > 
> > We've recently seen a workload on XFS filesystems with a repeatable
> > deadlock between background writeback and a multi-process
> > application doing concurrent writes and fsyncs to a small range of a
> > file.
> > 
> > ...
> > 
> > Signed-off-by: Dave Chinner 
> > Reviewed-by: Jan Kara 
> 
> Not a serious enough problem for a -stable backport?

Don't have enough evidence to say one way or another. The reported
incident was from a RHEL 7 kernel, so the bug has been there for
years in one form or another, but it's only ever been triggered by
this one-off custom workload.

I haven't done any analysis on older kernels, nor have I looked to see
if there's any gotchas that a stable backport might encounter. And I
tend not to change stuff in a path that is critical to data integrity
without at least doing enough due diligence to suggest a stable
backport would be fine.

You can mark it for stable backports if you want, but I'm not
prepared to because I haven't done the work necessary to ensure it's
safe to do so.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH] writeback: fix range_cyclic writeback vs writepages deadlock

2018-10-05 Thread Dave Chinner
On Fri, Oct 05, 2018 at 12:46:40PM -0700, Andrew Morton wrote:
> On Fri,  5 Oct 2018 15:45:26 +1000 Dave Chinner  wrote:
> 
> > From: Dave Chinner 
> > 
> > We've recently seen a workload on XFS filesystems with a repeatable
> > deadlock between background writeback and a multi-process
> > application doing concurrent writes and fsyncs to a small range of a
> > file.
> > 
> > ...
> > 
> > Signed-off-by: Dave Chinner 
> > Reviewed-by: Jan Kara 
> 
> Not a serious enough problem for a -stable backport?

Don't have enough evidence to say one way or another. The reported
incident was from a RHEL 7 kernel, so the bug has been there for
years in one form or another, but it's only ever been triggered by
this one-off custom workload.

I haven't done any analysis on older kernels, nor have I looked to see
if there's any gotchas that a stable backport might encounter. And I
tend not to change stuff in a path that is critical to data integrity
without at least doing enough due diligence to suggest a stable
backport would be fine.

You can mark it for stable backports if you want, but I'm not
prepared to because I haven't done the work necessary to ensure it's
safe to do so.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH] writeback: fix range_cyclic writeback vs writepages deadlock

2018-10-05 Thread Andrew Morton
On Fri,  5 Oct 2018 15:45:26 +1000 Dave Chinner  wrote:

> From: Dave Chinner 
> 
> We've recently seen a workload on XFS filesystems with a repeatable
> deadlock between background writeback and a multi-process
> application doing concurrent writes and fsyncs to a small range of a
> file.
> 
> ...
> 
> Signed-off-by: Dave Chinner 
> Reviewed-by: Jan Kara 

Not a serious enough problem for a -stable backport?


Re: [PATCH] writeback: fix range_cyclic writeback vs writepages deadlock

2018-10-05 Thread Andrew Morton
On Fri,  5 Oct 2018 15:45:26 +1000 Dave Chinner  wrote:

> From: Dave Chinner 
> 
> We've recently seen a workload on XFS filesystems with a repeatable
> deadlock between background writeback and a multi-process
> application doing concurrent writes and fsyncs to a small range of a
> file.
> 
> ...
> 
> Signed-off-by: Dave Chinner 
> Reviewed-by: Jan Kara 

Not a serious enough problem for a -stable backport?


[PATCH] writeback: fix range_cyclic writeback vs writepages deadlock

2018-10-04 Thread Dave Chinner
From: Dave Chinner 

We've recently seen a workload on XFS filesystems with a repeatable
deadlock between background writeback and a multi-process
application doing concurrent writes and fsyncs to a small range of a
file.

range_cyclic
writeback   Process 1   Process 2

xfs_vm_writepages
  write_cache_pages
writeback_index = 2
cycled = 0

find page 2 dirty
lock Page 2
->writepage
  page 2 writeback
  page 2 clean
  page 2 added to bio
no more pages
write()
locks page 1
dirties page 1
locks page 2
dirties page 1
fsync()

xfs_vm_writepages
write_cache_pages
  start index 0
  find page 1 towrite
  lock Page 1
  ->writepage
page 1 writeback
page 1 clean
page 1 added to bio
  find page 2 towrite
  lock Page 2
  page 2 is writeback
  
write()
locks page 1
dirties page 1
fsync()

xfs_vm_writepages
write_cache_pages
  start index 0

!done && !cycled
  sets index to 0, restarts lookup
find page 1 dirty
  find page 1 towrite
  lock Page 1
  page 1 is writeback
  

lock Page 1


DEADLOCK because:

- process 1 needs page 2 writeback to complete to make
  enough progress to issue IO pending for page 1
- writeback needs page 1 writeback to complete so process 2
  can progress and unlock the page it is blocked on, then it
  can issue the IO pending for page 2
- process 2 can't make progress until process 1 issues IO
  for page 1

The underlying cause of the problem here is that range_cyclic
writeback is processing pages in descending index order as we
hold higher index pages in a structure controlled from above
write_cache_pages(). The write_cache_pages() caller needs to
be able to submit these pages for IO before write_cache_pages
restarts writeback at mapping index 0 to avoid wcp inverting the
page lock/writeback wait order.

generic_writepages() is not susceptible to this bug as it has no
private context held across write_cache_pages() - filesystems using
this infrastructure always submit pages in ->writepage immediately
and so there is no problem with range_cyclic going back to mapping
index 0.

However:
mpage_writepages() has a private bio context,
exofs_writepages() has page_collect
fuse_writepages() has fuse_fill_wb_data
nfs_writepages() has nfs_pageio_descriptor
xfs_vm_writepages() has xfs_writepage_ctx

All of these ->writepages implementations can hold pages under
writeback in their private structures until write_cache_pages()
returns, and hence they are all susceptible to this deadlock.

Also worth noting is that ext4 has it's own bastardised version of
write_cache_pages() and so it /may/ have an equivalent deadlock. I
looked at the code long enough to understand that it has a similar
retry loop for range_cyclic writeback reaching the end of the file
and then promptly ran away before my eyes bled too much.  I'll leave
it for the ext4 developers to determine if their code is actually
has this deadlock and how to fix it if it has.

There's a few ways I can see avoid this deadlock. There's probably
more, but these are the first I've though of:

1. get rid of range_cyclic altogether

2. range_cyclic always stops at EOF, and we start again from
writeback index 0 on the next call into write_cache_pages()

2a. wcp also returns EAGAIN to ->writepages implementations to
indicate range cyclic has hit EOF. writepages implementations can
then flush the current context and call wpc again to continue. i.e.
lift the retry into the ->writepages implementation

3. range_cyclic uses trylock_page() rather than lock_page(), and it
skips pages it can't lock without blocking. It will already do this
for pages under writeback, so this seems like a no-brainer

3a. all non-WB_SYNC_ALL writeback uses trylock_page() to avoid
blocking as per pages under writeback.

I don't think #1 is an option - 

[PATCH] writeback: fix range_cyclic writeback vs writepages deadlock

2018-10-04 Thread Dave Chinner
From: Dave Chinner 

We've recently seen a workload on XFS filesystems with a repeatable
deadlock between background writeback and a multi-process
application doing concurrent writes and fsyncs to a small range of a
file.

range_cyclic
writeback   Process 1   Process 2

xfs_vm_writepages
  write_cache_pages
writeback_index = 2
cycled = 0

find page 2 dirty
lock Page 2
->writepage
  page 2 writeback
  page 2 clean
  page 2 added to bio
no more pages
write()
locks page 1
dirties page 1
locks page 2
dirties page 1
fsync()

xfs_vm_writepages
write_cache_pages
  start index 0
  find page 1 towrite
  lock Page 1
  ->writepage
page 1 writeback
page 1 clean
page 1 added to bio
  find page 2 towrite
  lock Page 2
  page 2 is writeback
  
write()
locks page 1
dirties page 1
fsync()

xfs_vm_writepages
write_cache_pages
  start index 0

!done && !cycled
  sets index to 0, restarts lookup
find page 1 dirty
  find page 1 towrite
  lock Page 1
  page 1 is writeback
  

lock Page 1


DEADLOCK because:

- process 1 needs page 2 writeback to complete to make
  enough progress to issue IO pending for page 1
- writeback needs page 1 writeback to complete so process 2
  can progress and unlock the page it is blocked on, then it
  can issue the IO pending for page 2
- process 2 can't make progress until process 1 issues IO
  for page 1

The underlying cause of the problem here is that range_cyclic
writeback is processing pages in descending index order as we
hold higher index pages in a structure controlled from above
write_cache_pages(). The write_cache_pages() caller needs to
be able to submit these pages for IO before write_cache_pages
restarts writeback at mapping index 0 to avoid wcp inverting the
page lock/writeback wait order.

generic_writepages() is not susceptible to this bug as it has no
private context held across write_cache_pages() - filesystems using
this infrastructure always submit pages in ->writepage immediately
and so there is no problem with range_cyclic going back to mapping
index 0.

However:
mpage_writepages() has a private bio context,
exofs_writepages() has page_collect
fuse_writepages() has fuse_fill_wb_data
nfs_writepages() has nfs_pageio_descriptor
xfs_vm_writepages() has xfs_writepage_ctx

All of these ->writepages implementations can hold pages under
writeback in their private structures until write_cache_pages()
returns, and hence they are all susceptible to this deadlock.

Also worth noting is that ext4 has it's own bastardised version of
write_cache_pages() and so it /may/ have an equivalent deadlock. I
looked at the code long enough to understand that it has a similar
retry loop for range_cyclic writeback reaching the end of the file
and then promptly ran away before my eyes bled too much.  I'll leave
it for the ext4 developers to determine if their code is actually
has this deadlock and how to fix it if it has.

There's a few ways I can see avoid this deadlock. There's probably
more, but these are the first I've though of:

1. get rid of range_cyclic altogether

2. range_cyclic always stops at EOF, and we start again from
writeback index 0 on the next call into write_cache_pages()

2a. wcp also returns EAGAIN to ->writepages implementations to
indicate range cyclic has hit EOF. writepages implementations can
then flush the current context and call wpc again to continue. i.e.
lift the retry into the ->writepages implementation

3. range_cyclic uses trylock_page() rather than lock_page(), and it
skips pages it can't lock without blocking. It will already do this
for pages under writeback, so this seems like a no-brainer

3a. all non-WB_SYNC_ALL writeback uses trylock_page() to avoid
blocking as per pages under writeback.

I don't think #1 is an option -