Re: [PATCH] dax: fix deadlock in __dax_fault

2015-10-02 Thread Jan Kara
On Tue 29-09-15 20:04:21, Ross Zwisler wrote:
> On Tue, Sep 29, 2015 at 12:44:58PM +1000, Dave Chinner wrote:
> <>
> > Already testing a kernel with those reverted. My current DAX patch
> > stack is (bottom is first commit in stack):
> > 
> > f672ae4 xfs: add ->pfn_mkwrite support for DAX
> > 6855c23 xfs: remove DAX complete_unwritten callback
> > e074bdf Revert "dax: fix race between simultaneous faults"
> > 8ba0157 Revert "mm: take i_mmap_lock in unmap_mapping_range() for DAX"
> > a2ce6a5 xfs: DAX does not use IO completion callbacks
> > 246c52a xfs: update size during allocation for DAX
> > 9d10e7b xfs: Don't use unwritten extents for DAX
> > eaef807 xfs: factor out sector mapping.
> > e7f2d50 xfs: introduce per-inode DAX enablement
> 
> Dave, would you be willing to share these patches with me, even if they are
> just RFC?  I'm working through how to add equivalent support in both ext2 and
> ext4, and a conceptual example in XFS would be really helpful.
> 
> Eric and Jan, if you guys have any spare cycles to help with the ext2/ext4
> bits, I certainly wouldn't say no.  :)

I'm sorry for being slow but I was on vacation or travelling for
conferences the whole September (still at a conference now ;) so I'm in
mostly just catching up with what's going on... But I can help with making
necessary changes to ext4 to make DAX reliable there.

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] dax: fix deadlock in __dax_fault

2015-09-29 Thread Dave Chinner
On Tue, Sep 29, 2015 at 08:04:21PM -0600, Ross Zwisler wrote:
> On Tue, Sep 29, 2015 at 12:44:58PM +1000, Dave Chinner wrote:
> <>
> > Already testing a kernel with those reverted. My current DAX patch
> > stack is (bottom is first commit in stack):
> > 
> > f672ae4 xfs: add ->pfn_mkwrite support for DAX
> > 6855c23 xfs: remove DAX complete_unwritten callback
> > e074bdf Revert "dax: fix race between simultaneous faults"
> > 8ba0157 Revert "mm: take i_mmap_lock in unmap_mapping_range() for DAX"
> > a2ce6a5 xfs: DAX does not use IO completion callbacks
> > 246c52a xfs: update size during allocation for DAX
> > 9d10e7b xfs: Don't use unwritten extents for DAX
> > eaef807 xfs: factor out sector mapping.
> > e7f2d50 xfs: introduce per-inode DAX enablement
> 
> Dave, would you be willing to share these patches with me, even if they are
> just RFC?  I'm working through how to add equivalent support in both ext2 and
> ext4, and a conceptual example in XFS would be really helpful.

When I have code that works properly and isn't fundamentally broken,
I'll post it. Hopefully within the next day.

> Eric and Jan, if you guys have any spare cycles to help with the ext2/ext4
> bits, I certainly wouldn't say no.  :)

ext2 already does the block zeroing in allocation, so it only needs
locking help...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] dax: fix deadlock in __dax_fault

2015-09-29 Thread Ross Zwisler
On Tue, Sep 29, 2015 at 12:44:58PM +1000, Dave Chinner wrote:
<>
> Already testing a kernel with those reverted. My current DAX patch
> stack is (bottom is first commit in stack):
> 
> f672ae4 xfs: add ->pfn_mkwrite support for DAX
> 6855c23 xfs: remove DAX complete_unwritten callback
> e074bdf Revert "dax: fix race between simultaneous faults"
> 8ba0157 Revert "mm: take i_mmap_lock in unmap_mapping_range() for DAX"
> a2ce6a5 xfs: DAX does not use IO completion callbacks
> 246c52a xfs: update size during allocation for DAX
> 9d10e7b xfs: Don't use unwritten extents for DAX
> eaef807 xfs: factor out sector mapping.
> e7f2d50 xfs: introduce per-inode DAX enablement

Dave, would you be willing to share these patches with me, even if they are
just RFC?  I'm working through how to add equivalent support in both ext2 and
ext4, and a conceptual example in XFS would be really helpful.

Eric and Jan, if you guys have any spare cycles to help with the ext2/ext4
bits, I certainly wouldn't say no.  :)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] dax: fix deadlock in __dax_fault

2015-09-29 Thread Dave Chinner
On Tue, Sep 29, 2015 at 12:44:58PM +1000, Dave Chinner wrote:
> On Mon, Sep 28, 2015 at 04:40:01PM -0600, Ross Zwisler wrote:
> > > > 4) Test all changes with xfstests using both xfs & ext4, using lockep.
> > > > 
> > > > Did I miss any issues, or does this path not solve one of them somehow?
> > > > 
> > > > Does this sound like a reasonable path forward for v4.3?  Dave, and 
> > > > Jan, can
> > > > you guys can provide guidance and code reviews for the XFS and ext4 
> > > > bits?
> > > 
> > > IMO, it's way too much to get into 4.3. I'd much prefer we revert
> > > the bad changes in 4.3, and then work towards fixing this for the
> > > 4.4 merge window. If someone needs this for 4.3, then they can
> > > backport the 4.4 code to 4.3-stable.
> > > 
> > > The "fast and loose and fix it later" development model does not
> > > work for persistent storage algorithms; DAX is storage - not memory
> > > management - and so we need to treat it as such.
> > 
> > Okay.  To get our locking back to v4.2 levels here are the two commits I 
> > think
> > we need to look at:
> > 
> > commit 843172978bb9 ("dax: fix race between simultaneous faults")
> > commit 46c043ede471 ("mm: take i_mmap_lock in unmap_mapping_range() for 
> > DAX")
> 
> Already testing a kernel with those reverted. My current DAX patch
> stack is (bottom is first commit in stack):
> 

And just to indicate why 4.3 is completely unrealistic, let me give
you a summary of this patchset so far:

> f672ae4 xfs: add ->pfn_mkwrite support for DAX

I *think* it works.

> 6855c23 xfs: remove DAX complete_unwritten callback

Gone.

> e074bdf Revert "dax: fix race between simultaneous faults"
> 8ba0157 Revert "mm: take i_mmap_lock in unmap_mapping_range() for DAX"
> a2ce6a5 xfs: DAX does not use IO completion callbacks

DAX still needs to use IO completion callbacks for the DIO path, so
needed rewriting. Made 6855c23 redundant.

> 246c52a xfs: update size during allocation for DAX

Fundamentally broken, so removed. DIO passes the actual size from IO
completion, not into block allocation, hence DIO still needs
completion callbacks. DAX page faults can't change
the file size (should segv before we get here), so need to
specifically handle that to avoid leaking ioend structures due to
incorrect detection of EOF updates due to ovreflows...

> 9d10e7b xfs: Don't use unwritten extents for DAX

Exposed a behaviour in DIO and DAX that results in s64 variable
overflow when writing to the block at file offset (2^63 - 1FSB).
Both the DAX and DIO code ask for a mapping at:

xfs_get_blocks_alloc: [...] offset 0x7000 count 4096

which gives a size of 0x8000 (larger than
sb->s_maxbytes!) and results a sign overflow checking if a inode
size update is requireed.  Direct IO avoids this overflow because
the logic checks for unwritten extents first and the IO completion
callback that has the correct size. Removing unwritten extent
allocation from DAX exposed this bug through firing asserts all
through the XFS block mapping and IO completion callbacks

Fixed the overflow, testing got further and then fsx exposed another
problem similar to the size update issue above. Patch is
fundamentally broken: block zeroing needs to be driven all the way
into the low level allocator implementation to fix the problems fsx
exposed.

> eaef807 xfs: factor out sector mapping.

Probably not going to be used now.

So, basically, I've rewritten most of the patch set once, and I'm
about to fundamentally change it again to address problems the first
two versions have exposed.  Hopefully this will show you the
complexity of what we are dealing with here, and why I said this
needs to go through 4.4?

It should also help explain why I suggested that if ext4 developers
aren't interested in fixing DAX problems then we should just drop
ext4 DAX support? Making this stuff work correctly requires more
than just a cursory knowledge of a filesystem, and nobody actively
working on DAX has the qualifications to make these sorts of changes
to ext4...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] dax: fix deadlock in __dax_fault

2015-09-28 Thread Dave Chinner
On Mon, Sep 28, 2015 at 08:08:13PM -0700, Dan Williams wrote:
> On Mon, Sep 28, 2015 at 7:18 PM, Dave Chinner  wrote:
> > On Mon, Sep 28, 2015 at 03:57:29PM -0700, Dan Williams wrote:
> > I'm concerned with making what we have work before we go and change
> > everything. You might want to move really quickly, but without sane
> > filesystem support you can't ship anything worth a damn. There's all
> > sorts of issues here, and introducing struct pages doesn't solve all
> > of them.
> >
> > Let's concentrate on ensuring the basic operation of DAX is robust
> > first - get the page fault vs extent manipulations serialised, sane
> > and scalable before we start changing anything else. If we don't
> > solve these problems, then nothing else we do will be reliable, and
> > the problems exist regardless of whether we are using struct pages
> > or not. Hence these are the critical problems we need to fix before
> > anything else.
> >
> > Once we have these issues sorted out, switching between struct page
> > and pfn should be much simpler because we don't have to worry about
> > different locking strategies to protect against truncate, racing
> > page faults, etc.
> 
> It sounds like you have a page-independent/scalable method in mind for
> solving the truncate protection problem?

In mind? It was added to XFS back in 4.1 by commit 653c60b ("xfs:
introduce mmap/truncate lock").

> I had always thought that
> must require struct page, but if you're happy to carry that solution
> in the filesystem you're not going to see resistance from me.

The page based "truncate serialisation" solution in Linux has always
been a horrible, nasty hack. It only works because we always modify
the inode size before invalidating pages in the page cache. Hence we
can check once we have the page lock to see if the page is beyond
EOF. This EOF update hack avoids the need for a filesystem level
lock to guarantee multi-page truncate invalidation atomicity.

This, however, does not work for operations which require atomic
multi-page invalidation within EOF over multiple long running
operations. The page is not beyond EOF, so we have no way of
determining from the page taht we are racing with an invalidation of
the page. This affects operations such as hole punching, extent 
shifting up/down, swapping extents between different inodes, etc.

These *require* the filesystem to be able to lock out page faults
for the duration of the manipulation operation, as the extent
manipulations may not be done atomically from the perspective of
userspace driven page faults. They *are atomic* from the perspective
of the read/write syscall interfaces thanks to the XFS_IOLOCK_*
locking (usually the i_mutex provides this in other filesystems).

The XFS_MMAPLOCK_[SHARED|EXCL] locking was added to provide this
multi-page invalidation  vs page fault serialisation to XFS. It
works completely independently of any given struct page in the file,
and so does not rely on DAX to use struct page to serialise
correctly.  I've solved this problem in XFS because a generic
solution was not happening. Any filesystem that supports hole
punching and other fallocate-based manipulations needs similar
locking to be safe against page fault based /data corruption/ races
in these operations.

FWIW, direct IO also needs to exclude page faults for proper mmap
coherency, but that's much harder problem to solve because direct IO
takes the mmap_sem below the filesystem, and hence we can't hold any
locks over DIO submission that might be required in a page fault
within get_user_pages()

> >> I do not think introducing page-back persistent memory sets us back to
> >> square 1.  Instead, given the functionality that is enabled when pages
> >> are present I think it is safe to assume most platforms will arrange
> >> for page backed persistent memory.
> >
> > Sure, but it will take a little time to get there. Moving fast
> > doesn't help us here - it only results in stuff we have to revert or
> > redo in the near future and that means progress is much slower than
> > it should be. Let's solve the DAX problems in the right order - it
> > will make things simpler and faster down the road.
> 
> Sounds workable, although this thread is missing an ext4
> representative so far.  Hopefully ext4 is equally open to solving
> these problems generically without struct page.

It appears to me that none of the ext4 developers have shown any
interest in fixing the problems reported with DAX. Some of those
problems existed before DAX came along (e.g. unwritten extent
conversion issues that can cause data corruption) but I've seen no
interest in fixing them, either.  Hence I'm quite happy to drop ext4
DAX support until an ext4 dev steps up and fixes the issues that
need fixing...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org

Re: [PATCH] dax: fix deadlock in __dax_fault

2015-09-28 Thread Dan Williams
On Mon, Sep 28, 2015 at 7:18 PM, Dave Chinner  wrote:
> On Mon, Sep 28, 2015 at 03:57:29PM -0700, Dan Williams wrote:
>> On Mon, Sep 28, 2015 at 2:35 PM, Dave Chinner  wrote:
>> > On Mon, Sep 28, 2015 at 05:13:50AM -0700, Dan Williams wrote:
>> >> On Sun, Sep 27, 2015 at 5:59 PM, Dave Chinner  wrote:
>> >> > On Fri, Sep 25, 2015 at 09:17:45PM -0600, Ross Zwisler wrote:
>> >> >> On Fri, Sep 25, 2015 at 12:53:57PM +1000, Dave Chinner wrote:
>> >> [..]
>> >> >> Does this sound like a reasonable path forward for v4.3?  Dave, and 
>> >> >> Jan, can
>> >> >> you guys can provide guidance and code reviews for the XFS and ext4 
>> >> >> bits?
>> >> >
>> >> > IMO, it's way too much to get into 4.3. I'd much prefer we revert
>> >> > the bad changes in 4.3, and then work towards fixing this for the
>> >> > 4.4 merge window. If someone needs this for 4.3, then they can
>> >> > backport the 4.4 code to 4.3-stable.
>> >> >
>> >>
>> >> If the proposal is to step back and get a running start at these fixes
>> >> for 4.4, then it is worth considering what the state of allocating
>> >> pages for DAX mappings will be in 4.4.
>> >
>> > Oh, do tell. I haven't seen any published design, code, etc,
>>
>> This is via the devm_memremap_pages() api that went into 4.2 [1] and
>> my v1 (RFC quality) series using it for dax get_user_pages() [2].
>>
>> [1]: https://lkml.org/lkml/2015/8/25/841
>> [2]: https://lkml.org/lkml/2015/9/23/11
>
> I'll have a look at some point when I'm not trying to put out fires.
>
>> > And, quite frankly, I'm not enabling any new DAX behaviour/subsystem
>> > in XFS until I've had time to review, test and fix it so it works
>> > without deadlocking or corrupting data.
>>
>> I'm in violent agreement, to the point where I'm pondering whether
>> CONFIG_FS_DAX should just depend on CONFIG_BROKEN in 4.3 until we've
>> convinced ourselves of all the fixes in 4.4.  It's not clear to me
>> that we have a stable baseline to which we can revert this "still in
>> development" implementation, did you have one in mind?
>
> XFS warns that DAX is experimental when you mount with that option,
> so there is no need to do that:
>
> [  686.055780] XFS (ram0): DAX enabled. Warning: EXPERIMENTAL, use at your 
> own risk
> [  686.058464] XFS (ram0): Mounting V5 Filesystem
> [  686.062857] XFS (ram0): Ending clean mount

Well that is comforting, although a similar warning is missing from
ext4.  I'll send a patch.

>> >> It's already that case that
>> >> allocating struct page for DAX mappings is the only solution on the
>> >> horizon for enabling a get_user_pages() solution for persistent
>> >> memory.  We of course need to get the page-less DAX path fixed up, but
>> >> the near-term path to full functionality and safety is when struct
>> >> page is available to enable the typical synchronization mechanics.
>> >
>> > And we do so at the expense of medium to long term complexity and
>> > maintenance. I'm no fan of using struct pages to track terabytes to
>> > petabytes of persistent memory, and I'm even less of a fan of having
>> > to simultaneously support both struct page and pfn based DAX
>> > subsystems...
>>
>> I'm no fan of tracking petabytes of persistent memory with struct
>> page, but we're in the near term space (hardware technology-wise) of
>> how to enable DMA/RDMA to 100s of gigabytes to a few terabytes of
>> persistent memory.
>
> Don't think I don't know that - as I said to someone a few hours
> ago on IRC:
>
> [29/09/15 07:41]  I'm sure they do, but they have a hard 
> requirement to support RDMA from persistent memory
> [29/09/15 07:41]  and that's what seems to be driving the "we need 
> to use struct pages" design

Fair enough...

>> A page-less solution to that problem is not on the
>> horizon as far as I can tell.  In short, I am concerned we are
>> spending time working around the lack of struct page to get to a
>> stable page-less solution that is still missing support for the use
>> cases that are expected to "just work".
>
> I'm concerned with making what we have work before we go and change
> everything. You might want to move really quickly, but without sane
> filesystem support you can't ship anything worth a damn. There's all
> sorts of issues here, and introducing struct pages doesn't solve all
> of them.
>
> Let's concentrate on ensuring the basic operation of DAX is robust
> first - get the page fault vs extent manipulations serialised, sane
> and scalable before we start changing anything else. If we don't
> solve these problems, then nothing else we do will be reliable, and
> the problems exist regardless of whether we are using struct pages
> or not. Hence these are the critical problems we need to fix before
> anything else.
>
> Once we have these issues sorted out, switching between struct page
> and pfn should be much simpler because we don't have to worry about
> different locking strategies to protect against truncate, racing
> page faults, etc.

It sounds like you have a page-independent/scal

Re: [PATCH] dax: fix deadlock in __dax_fault

2015-09-28 Thread Dave Chinner
On Mon, Sep 28, 2015 at 04:40:01PM -0600, Ross Zwisler wrote:
> On Mon, Sep 28, 2015 at 10:59:04AM +1000, Dave Chinner wrote:
> > On Fri, Sep 25, 2015 at 09:17:45PM -0600, Ross Zwisler wrote:
> <>
> > In reality, the require DAX page fault vs truncate serialisation is
> > provided for XFS by the XFS_MMAPLOCK_* inode locking that is done in
> > the fault, mkwrite and filesystem extent manipulation paths. There
> > is no way this sort of exclusion can be done in the mm/ subsystem as
> > it simply does not have the context to be able to provide the
> > necessary serialisation.  Every filesystem that wants to use DAX
> > needs to provide this external page fault serialisation, and in
> > doing so will also protect it's hole punch/extent swap/shift
> > operations under normal operation against racing mmap access
> > 
> > IOWs, for DAX this needs to be fixed in ext4, not the mm/ subsystem.
> 
> So is it your belief that XFS already has correct locking in place to ensure
> that we don't hit these races?  I see XFS taking XFS_MMAPLOCK_SHARED before it
> calls __dax_fault() in both xfs_filemap_page_mkwrite() (via __dax_mkwrite) and
> in xfs_filemap_fault().
> 
> XFS takes XFS_MMAPLOCK_EXCL before a truncate in xfs_vn_setattr() - I haven't
> found the generic hole punching code yet, but I assume it takes
> XFS_MMAPLOCK_EXCL as well.

There is no generic hole punching. See xfs_file_fallocate(), where
most fallocate() based operations are protected, xfs_ioc_space()
where all the XFS ioctl based extent manipulations are protected,
xfs_swap_extents() where online defrag extent swaps are protected.
And we'll add it to any future operations that directly
manipulate extent mappings. 

> Meaning, is the work that we need to do around extent vs page fault locking
> basically adding equivalent locking to ext4 and ext2 and removing the attempts
> at locking from dax.c?

Yup. I'm not game to touch ext4 locking, though.

> 
> > > 4) Test all changes with xfstests using both xfs & ext4, using lockep.
> > > 
> > > Did I miss any issues, or does this path not solve one of them somehow?
> > > 
> > > Does this sound like a reasonable path forward for v4.3?  Dave, and Jan, 
> > > can
> > > you guys can provide guidance and code reviews for the XFS and ext4 bits?
> > 
> > IMO, it's way too much to get into 4.3. I'd much prefer we revert
> > the bad changes in 4.3, and then work towards fixing this for the
> > 4.4 merge window. If someone needs this for 4.3, then they can
> > backport the 4.4 code to 4.3-stable.
> > 
> > The "fast and loose and fix it later" development model does not
> > work for persistent storage algorithms; DAX is storage - not memory
> > management - and so we need to treat it as such.
> 
> Okay.  To get our locking back to v4.2 levels here are the two commits I think
> we need to look at:
> 
> commit 843172978bb9 ("dax: fix race between simultaneous faults")
> commit 46c043ede471 ("mm: take i_mmap_lock in unmap_mapping_range() for DAX")

Already testing a kernel with those reverted. My current DAX patch
stack is (bottom is first commit in stack):

f672ae4 xfs: add ->pfn_mkwrite support for DAX
6855c23 xfs: remove DAX complete_unwritten callback
e074bdf Revert "dax: fix race between simultaneous faults"
8ba0157 Revert "mm: take i_mmap_lock in unmap_mapping_range() for DAX"
a2ce6a5 xfs: DAX does not use IO completion callbacks
246c52a xfs: update size during allocation for DAX
9d10e7b xfs: Don't use unwritten extents for DAX
eaef807 xfs: factor out sector mapping.
e7f2d50 xfs: introduce per-inode DAX enablement

BTW, add to the problems that need fixing is that the pfn_mkwrite
code needs to check that the fault is still within EOF, like
__dax_fault does. i.e. the top patch in the series adds this
to xfs_filemap_pfn_mkwrite() instead of using dax_pfn_mkwrite():


+   /* check if the faulting page hasn't raced with truncate */
+   xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
+   size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+   if (vmf->pgoff >= size)
+   ret = VM_FAULT_SIGBUS;
+   xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
+   sb_end_pagefault(inode->i_sb);

i.e. dax_pfn_mkwrite() doesn't work correctly when racing with
truncate (i.e. another way that ext2/ext4 are currently broken).

> On an unrelated note, while wandering through the XFS code I found the
> following lock ordering documented above xfs_ilock():
> 
>  * Basic locking order:
>  *
>  * i_iolock -> i_mmap_lock -> page_lock -> i_ilock
>  *
>  * mmap_sem locking order:
>  *
>  * i_iolock -> page lock -> mmap_sem
>  * mmap_sem -> i_mmap_lock -> page_lock
> 
> I noticed that page_lock and i_mmap_lock are in different places in the
> ordering depending on the presence or absence of mmap_sem.  Does this not open
> us up to a lock ordering inversion?

Typo, not picked up in review (note the missing "_").

- * i_iolock -> page lock -> mmap_sem
+ * i_iolock -> page fault -> mmap_sem

Thanks for the heads-up on that.

Che

Re: [PATCH] dax: fix deadlock in __dax_fault

2015-09-28 Thread Dave Chinner
On Mon, Sep 28, 2015 at 03:57:29PM -0700, Dan Williams wrote:
> On Mon, Sep 28, 2015 at 2:35 PM, Dave Chinner  wrote:
> > On Mon, Sep 28, 2015 at 05:13:50AM -0700, Dan Williams wrote:
> >> On Sun, Sep 27, 2015 at 5:59 PM, Dave Chinner  wrote:
> >> > On Fri, Sep 25, 2015 at 09:17:45PM -0600, Ross Zwisler wrote:
> >> >> On Fri, Sep 25, 2015 at 12:53:57PM +1000, Dave Chinner wrote:
> >> [..]
> >> >> Does this sound like a reasonable path forward for v4.3?  Dave, and 
> >> >> Jan, can
> >> >> you guys can provide guidance and code reviews for the XFS and ext4 
> >> >> bits?
> >> >
> >> > IMO, it's way too much to get into 4.3. I'd much prefer we revert
> >> > the bad changes in 4.3, and then work towards fixing this for the
> >> > 4.4 merge window. If someone needs this for 4.3, then they can
> >> > backport the 4.4 code to 4.3-stable.
> >> >
> >>
> >> If the proposal is to step back and get a running start at these fixes
> >> for 4.4, then it is worth considering what the state of allocating
> >> pages for DAX mappings will be in 4.4.
> >
> > Oh, do tell. I haven't seen any published design, code, etc,
> 
> This is via the devm_memremap_pages() api that went into 4.2 [1] and
> my v1 (RFC quality) series using it for dax get_user_pages() [2].
> 
> [1]: https://lkml.org/lkml/2015/8/25/841
> [2]: https://lkml.org/lkml/2015/9/23/11

I'll have a look at some point when I'm not trying to put out fires.

> > And, quite frankly, I'm not enabling any new DAX behaviour/subsystem
> > in XFS until I've had time to review, test and fix it so it works
> > without deadlocking or corrupting data.
> 
> I'm in violent agreement, to the point where I'm pondering whether
> CONFIG_FS_DAX should just depend on CONFIG_BROKEN in 4.3 until we've
> convinced ourselves of all the fixes in 4.4.  It's not clear to me
> that we have a stable baseline to which we can revert this "still in
> development" implementation, did you have one in mind?

XFS warns that DAX is experimental when you mount with that option,
so there is no need to do that:

[  686.055780] XFS (ram0): DAX enabled. Warning: EXPERIMENTAL, use at your own 
risk
[  686.058464] XFS (ram0): Mounting V5 Filesystem
[  686.062857] XFS (ram0): Ending clean mount

> >> It's already that case that
> >> allocating struct page for DAX mappings is the only solution on the
> >> horizon for enabling a get_user_pages() solution for persistent
> >> memory.  We of course need to get the page-less DAX path fixed up, but
> >> the near-term path to full functionality and safety is when struct
> >> page is available to enable the typical synchronization mechanics.
> >
> > And we do so at the expense of medium to long term complexity and
> > maintenance. I'm no fan of using struct pages to track terabytes to
> > petabytes of persistent memory, and I'm even less of a fan of having
> > to simultaneously support both struct page and pfn based DAX
> > subsystems...
> 
> I'm no fan of tracking petabytes of persistent memory with struct
> page, but we're in the near term space (hardware technology-wise) of
> how to enable DMA/RDMA to 100s of gigabytes to a few terabytes of
> persistent memory.

Don't think I don't know that - as I said to someone a few hours
ago on IRC:

[29/09/15 07:41]  I'm sure they do, but they have a hard requirement 
to support RDMA from persistent memory
[29/09/15 07:41]  and that's what seems to be driving the "we need to 
use struct pages" design

> A page-less solution to that problem is not on the
> horizon as far as I can tell.  In short, I am concerned we are
> spending time working around the lack of struct page to get to a
> stable page-less solution that is still missing support for the use
> cases that are expected to "just work".

I'm concerned with making what we have work before we go and change
everything. You might want to move really quickly, but without sane
filesystem support you can't ship anything worth a damn. There's all
sorts of issues here, and introducing struct pages doesn't solve all
of them.

Let's concentrate on ensuring the basic operation of DAX is robust
first - get the page fault vs extent manipulations serialised, sane
and scalable before we start changing anything else. If we don't
solve these problems, then nothing else we do will be reliable, and
the problems exist regardless of whether we are using struct pages
or not. Hence these are the critical problems we need to fix before
anything else.

Once we have these issues sorted out, switching between struct page
and pfn should be much simpler because we don't have to worry about
different locking strategies to protect against truncate, racing
page faults, etc.

> I do not think introducing page-back persistent memory sets us back to
> square 1.  Instead, given the functionality that is enabled when pages
> are present I think it is safe to assume most platforms will arrange
> for page backed persistent memory.

Sure, but it will take a little time to get there. Moving fast
doesn't help us he

Re: [PATCH] dax: fix deadlock in __dax_fault

2015-09-28 Thread Dan Williams
On Mon, Sep 28, 2015 at 2:35 PM, Dave Chinner  wrote:
> On Mon, Sep 28, 2015 at 05:13:50AM -0700, Dan Williams wrote:
>> On Sun, Sep 27, 2015 at 5:59 PM, Dave Chinner  wrote:
>> > On Fri, Sep 25, 2015 at 09:17:45PM -0600, Ross Zwisler wrote:
>> >> On Fri, Sep 25, 2015 at 12:53:57PM +1000, Dave Chinner wrote:
>> [..]
>> >> Does this sound like a reasonable path forward for v4.3?  Dave, and Jan, 
>> >> can
>> >> you guys can provide guidance and code reviews for the XFS and ext4 bits?
>> >
>> > IMO, it's way too much to get into 4.3. I'd much prefer we revert
>> > the bad changes in 4.3, and then work towards fixing this for the
>> > 4.4 merge window. If someone needs this for 4.3, then they can
>> > backport the 4.4 code to 4.3-stable.
>> >
>>
>> If the proposal is to step back and get a running start at these fixes
>> for 4.4, then it is worth considering what the state of allocating
>> pages for DAX mappings will be in 4.4.
>
> Oh, do tell. I haven't seen any published design, code, etc,

This is via the devm_memremap_pages() api that went into 4.2 [1] and
my v1 (RFC quality) series using it for dax get_user_pages() [2].

[1]: https://lkml.org/lkml/2015/8/25/841
[2]: https://lkml.org/lkml/2015/9/23/11

> and I certainly haven't planned any time in the 4.4 window to do a
> complete audit, rework and test of the XFS DAX code. So if you want
> a working DAX implementation in the short term, we need to fix what
> we have and not do wholesale changes to infrastructure that put us
> back to square 1.

Yes, as Ross educated me, the current split of what is handled in the
filesystem vs what is handled in __dax_fault() potentially makes the
availability of struct page moot because the locking does not work if
initiated from within fs/dax.c...

> And, quite frankly, I'm not enabling any new DAX behaviour/subsystem
> in XFS until I've had time to review, test and fix it so it works
> without deadlocking or corrupting data.

I'm in violent agreement, to the point where I'm pondering whether
CONFIG_FS_DAX should just depend on CONFIG_BROKEN in 4.3 until we've
convinced ourselves of all the fixes in 4.4.  It's not clear to me
that we have a stable baseline to which we can revert this "still in
development" implementation, did you have one in mind?

>> It's already that case that
>> allocating struct page for DAX mappings is the only solution on the
>> horizon for enabling a get_user_pages() solution for persistent
>> memory.  We of course need to get the page-less DAX path fixed up, but
>> the near-term path to full functionality and safety is when struct
>> page is available to enable the typical synchronization mechanics.
>
> And we do so at the expense of medium to long term complexity and
> maintenance. I'm no fan of using struct pages to track terabytes to
> petabytes of persistent memory, and I'm even less of a fan of having
> to simultaneously support both struct page and pfn based DAX
> subsystems...

I'm no fan of tracking petabytes of persistent memory with struct
page, but we're in the near term space (hardware technology-wise) of
how to enable DMA/RDMA to 100s of gigabytes to a few terabytes of
persistent memory.  A page-less solution to that problem is not on the
horizon as far as I can tell.  In short, I am concerned we are
spending time working around the lack of struct page to get to a
stable page-less solution that is still missing support for the use
cases that are expected to "just work".

I do not think introducing page-back persistent memory sets us back to
square 1.  Instead, given the functionality that is enabled when pages
are present I think it is safe to assume most platforms will arrange
for page backed persistent memory.  If the page-less case is rare to
non-existent then we should design for the page-backed case at least
until the "petabytes of persistent memory" era arrives.  I think we
have plenty of time to get page-less right before it is needed, but we
have to get over the roadblocks that Christoph and I hit even trying
to convert the DMA-API over to be pfn based [3].

[3]: https://lkml.org/lkml/2015/8/12/682
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] dax: fix deadlock in __dax_fault

2015-09-28 Thread Ross Zwisler
On Mon, Sep 28, 2015 at 10:59:04AM +1000, Dave Chinner wrote:
> On Fri, Sep 25, 2015 at 09:17:45PM -0600, Ross Zwisler wrote:
<>
> In reality, the require DAX page fault vs truncate serialisation is
> provided for XFS by the XFS_MMAPLOCK_* inode locking that is done in
> the fault, mkwrite and filesystem extent manipulation paths. There
> is no way this sort of exclusion can be done in the mm/ subsystem as
> it simply does not have the context to be able to provide the
> necessary serialisation.  Every filesystem that wants to use DAX
> needs to provide this external page fault serialisation, and in
> doing so will also protect it's hole punch/extent swap/shift
> operations under normal operation against racing mmap access
> 
> IOWs, for DAX this needs to be fixed in ext4, not the mm/ subsystem.

So is it your belief that XFS already has correct locking in place to ensure
that we don't hit these races?  I see XFS taking XFS_MMAPLOCK_SHARED before it
calls __dax_fault() in both xfs_filemap_page_mkwrite() (via __dax_mkwrite) and
in xfs_filemap_fault().

XFS takes XFS_MMAPLOCK_EXCL before a truncate in xfs_vn_setattr() - I haven't
found the generic hole punching code yet, but I assume it takes
XFS_MMAPLOCK_EXCL as well.

Meaning, is the work that we need to do around extent vs page fault locking
basically adding equivalent locking to ext4 and ext2 and removing the attempts
at locking from dax.c?

> > 4) Test all changes with xfstests using both xfs & ext4, using lockep.
> > 
> > Did I miss any issues, or does this path not solve one of them somehow?
> > 
> > Does this sound like a reasonable path forward for v4.3?  Dave, and Jan, can
> > you guys can provide guidance and code reviews for the XFS and ext4 bits?
> 
> IMO, it's way too much to get into 4.3. I'd much prefer we revert
> the bad changes in 4.3, and then work towards fixing this for the
> 4.4 merge window. If someone needs this for 4.3, then they can
> backport the 4.4 code to 4.3-stable.
> 
> The "fast and loose and fix it later" development model does not
> work for persistent storage algorithms; DAX is storage - not memory
> management - and so we need to treat it as such.

Okay.  To get our locking back to v4.2 levels here are the two commits I think
we need to look at:

commit 843172978bb9 ("dax: fix race between simultaneous faults")
commit 46c043ede471 ("mm: take i_mmap_lock in unmap_mapping_range() for DAX")

The former is the one that introduced the heavy reliance on write locks of the
mmap semaphore and introduced the various deadlocks that we've found.  The
latter moved some of that locking around so we didn't hold a write lock on the
mmap semaphore during unmap_mapping_range().

Does this sound correct to you?

On an unrelated note, while wandering through the XFS code I found the
following lock ordering documented above xfs_ilock():

 * Basic locking order:
 *
 * i_iolock -> i_mmap_lock -> page_lock -> i_ilock
 *
 * mmap_sem locking order:
 *
 * i_iolock -> page lock -> mmap_sem
 * mmap_sem -> i_mmap_lock -> page_lock

I noticed that page_lock and i_mmap_lock are in different places in the
ordering depending on the presence or absence of mmap_sem.  Does this not open
us up to a lock ordering inversion?

Thread 1 (mmap_sem) Thread 2 (no mmap_sem)
--- --
page_lock
mmap_sem
i_mmap_lock



Thanks,
- Ross
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] dax: fix deadlock in __dax_fault

2015-09-28 Thread Dave Chinner
On Mon, Sep 28, 2015 at 05:13:50AM -0700, Dan Williams wrote:
> On Sun, Sep 27, 2015 at 5:59 PM, Dave Chinner  wrote:
> > On Fri, Sep 25, 2015 at 09:17:45PM -0600, Ross Zwisler wrote:
> >> On Fri, Sep 25, 2015 at 12:53:57PM +1000, Dave Chinner wrote:
> [..]
> >> Does this sound like a reasonable path forward for v4.3?  Dave, and Jan, 
> >> can
> >> you guys can provide guidance and code reviews for the XFS and ext4 bits?
> >
> > IMO, it's way too much to get into 4.3. I'd much prefer we revert
> > the bad changes in 4.3, and then work towards fixing this for the
> > 4.4 merge window. If someone needs this for 4.3, then they can
> > backport the 4.4 code to 4.3-stable.
> >
> 
> If the proposal is to step back and get a running start at these fixes
> for 4.4, then it is worth considering what the state of allocating
> pages for DAX mappings will be in 4.4.

Oh, do tell. I haven't seen any published design, code, etc, and I
certainly haven't planned any time in the 4.4 window to do a
complete audit, rework and test of the XFS DAX code. So if you want
a working DAX implementation in the short term, we need to fix what
we have and not do wholesale changes to infrastructure that put us
back to square 1.

And, quite frankly, I'm not enabling any new DAX behaviour/subsystem
in XFS until I've had time to review, test and fix it so it works
without deadlocking or corrupting data.

> It's already that case that
> allocating struct page for DAX mappings is the only solution on the
> horizon for enabling a get_user_pages() solution for persistent
> memory.  We of course need to get the page-less DAX path fixed up, but
> the near-term path to full functionality and safety is when struct
> page is available to enable the typical synchronization mechanics.

And we do so at the expense of medium to long term complexity and
maintenance. I'm no fan of using struct pages to track terabytes to
petabytes of persistent memory, and I'm even less of a fan of having
to simultaneously support both struct page and pfn based DAX
subsystems...

> That said, it's not clear that saves us any work given the axonram and
> dcssblk DAX drivers do not currently allocate struct page, and pmem
> optionally allocates struct page.

Precisely my point.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] dax: fix deadlock in __dax_fault

2015-09-28 Thread Dan Williams
On Sun, Sep 27, 2015 at 5:59 PM, Dave Chinner  wrote:
> On Fri, Sep 25, 2015 at 09:17:45PM -0600, Ross Zwisler wrote:
>> On Fri, Sep 25, 2015 at 12:53:57PM +1000, Dave Chinner wrote:
[..]
>> Does this sound like a reasonable path forward for v4.3?  Dave, and Jan, can
>> you guys can provide guidance and code reviews for the XFS and ext4 bits?
>
> IMO, it's way too much to get into 4.3. I'd much prefer we revert
> the bad changes in 4.3, and then work towards fixing this for the
> 4.4 merge window. If someone needs this for 4.3, then they can
> backport the 4.4 code to 4.3-stable.
>

If the proposal is to step back and get a running start at these fixes
for 4.4, then it is worth considering what the state of allocating
pages for DAX mappings will be in 4.4.  It's already that case that
allocating struct page for DAX mappings is the only solution on the
horizon for enabling a get_user_pages() solution for persistent
memory.  We of course need to get the page-less DAX path fixed up, but
the near-term path to full functionality and safety is when struct
page is available to enable the typical synchronization mechanics.

That said, it's not clear that saves us any work given the axonram and
dcssblk DAX drivers do not currently allocate struct page, and pmem
optionally allocates struct page.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: [PATCH] dax: fix deadlock in __dax_fault

2015-09-28 Thread kbuild test robot
Hi Dave,

[auto build test results on v4.3-rc3 -- if it's inappropriate base, please 
ignore]

config: xtensa-allyesconfig (attached as .config)
reproduce:
  wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
  chmod +x ~/bin/make.cross
  git checkout 285753fa883fcbeac6b393da338b6e976af57912
  # save the attached .config to linux build tree
  make.cross ARCH=xtensa 

All warnings (new ones prefixed by >>):

   fs/xfs/xfs_iomap.c: In function 'xfs_iomap_write_direct':
>> fs/xfs/xfs_iomap.c:247:5: warning: format '%ld' expects argument of type 
>> 'long int', but argument 5 has type 'size_t' [-Wformat=]
imap->br_blockcount);
^
>> fs/xfs/xfs_iomap.c:247:5: warning: format '%ld' expects argument of type 
>> 'long int', but argument 6 has type 'sector_t' [-Wformat=]

vim +247 fs/xfs/xfs_iomap.c

   231  /* DAX needs to zero the entire allocated extent here */
   232  if (IS_DAX(VFS_I(ip)) && nimaps) {
   233  sector_t sector = xfs_imap_to_sector(VFS_I(ip), imap, 
offset);
   234  
   235  ASSERT(!ISUNWRITTEN(imap));
   236  ASSERT(nimaps == 1);
   237  error = dax_clear_blocks(VFS_I(ip),
   238  sector >> (VFS_I(ip)->i_blkbits - 
BBSHIFT),
   239  XFS_FSB_TO_B(mp, imap->br_blockcount));
   240  if (error) {
   241  xfs_warn(mp,
   242   "err %d, off/cnt %lld/%ld, sector %ld, bytes %lld, im.stblk %lld, 
im.stoff %lld, im.blkcnt %lld",
   243  error, offset, count, 
   244  xfs_imap_to_sector(VFS_I(ip), imap, 
offset),
   245  XFS_FSB_TO_B(mp, imap->br_blockcount),
   246  imap->br_startblock, imap->br_startoff,
 > 247  imap->br_blockcount);
   248  goto out_trans_cancel;
   249  }
   250  }
   251  
   252  error = xfs_trans_commit(tp);
   253  if (error)
   254  goto out_unlock;
   255  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data


Re: Re: [PATCH] dax: fix deadlock in __dax_fault

2015-09-28 Thread kbuild test robot
Hi Dave,

[auto build test results on v4.3-rc3 -- if it's inappropriate base, please 
ignore]

config: mips-allyesconfig (attached as .config)
reproduce:
  wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
  chmod +x ~/bin/make.cross
  git checkout 285753fa883fcbeac6b393da338b6e976af57912
  # save the attached .config to linux build tree
  make.cross ARCH=mips 

All warnings (new ones prefixed by >>):

   fs/xfs/xfs_iomap.c: In function 'xfs_iomap_write_direct':
>> fs/xfs/xfs_iomap.c:242:2: warning: format '%ld' expects argument of type 
>> 'long int', but argument 5 has type 'size_t {aka unsigned int}' [-Wformat=]
 "err %d, off/cnt %lld/%ld, sector %ld, bytes %lld, im.stblk %lld, im.stoff 
%lld, im.blkcnt %lld",
 ^
>> fs/xfs/xfs_iomap.c:242:2: warning: format '%ld' expects argument of type 
>> 'long int', but argument 6 has type 'sector_t {aka long long unsigned int}' 
>> [-Wformat=]

vim +242 fs/xfs/xfs_iomap.c

   226   */
   227  error = xfs_bmap_finish(&tp, &free_list, &committed);
   228  if (error)
   229  goto out_bmap_cancel;
   230  
   231  /* DAX needs to zero the entire allocated extent here */
   232  if (IS_DAX(VFS_I(ip)) && nimaps) {
   233  sector_t sector = xfs_imap_to_sector(VFS_I(ip), imap, 
offset);
   234  
   235  ASSERT(!ISUNWRITTEN(imap));
   236  ASSERT(nimaps == 1);
   237  error = dax_clear_blocks(VFS_I(ip),
   238  sector >> (VFS_I(ip)->i_blkbits - 
BBSHIFT),
   239  XFS_FSB_TO_B(mp, imap->br_blockcount));
   240  if (error) {
   241  xfs_warn(mp,
 > 242   "err %d, off/cnt %lld/%ld, sector %ld, bytes %lld, im.stblk %lld, 
 > im.stoff %lld, im.blkcnt %lld",
   243  error, offset, count, 
   244  xfs_imap_to_sector(VFS_I(ip), imap, 
offset),
   245  XFS_FSB_TO_B(mp, imap->br_blockcount),
   246  imap->br_startblock, imap->br_startoff,
   247  imap->br_blockcount);
   248  goto out_trans_cancel;
   249  }
   250  }

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data


Re: [PATCH] dax: fix deadlock in __dax_fault

2015-09-28 Thread Dave Chinner
On Mon, Sep 28, 2015 at 10:59:04AM +1000, Dave Chinner wrote:
> > Does this sound like a reasonable path forward for v4.3?  Dave, and Jan, can
> > you guys can provide guidance and code reviews for the XFS and ext4 bits?
> 
> IMO, it's way too much to get into 4.3. I'd much prefer we revert
> the bad changes in 4.3, and then work towards fixing this for the
> 4.4 merge window. If someone needs this for 4.3, then they can
> backport the 4.4 code to 4.3-stable.

FWIW, here's the first bit of making XFS clear blocks during
allocation. There's a couple of things I need to fix (e.g. moving
the zeroing out of the transaction context but still under the inode
allocation lock and hence atomic with the allocation), it needs to
be split into at least two patches (to split out the
xfs_imap_to_sector() helper), and there's another patch to remove
the complete_unwritten callbacks. I also need to do more audit and
validation on the handling of unwritten extents during get_blocks()
for write faults - the get_blocks() call in this case effectively
becomes an unwritten extent conversion call rather than an
allocation call, and I need to validate that it does what get_block
expects it to do. And that I haven't broken any of the other
get_blocks callers. So still lots to do.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

xfs: Don't use unwritten extents for DAX

From: Dave Chinner 

DAX has a page fault serialisation problem with block allocation.
Because it allows concurrent page faults and does not have a page
lock to serialise faults to the same page, it can get two concurrent
faults to the page that race.

When two read faults race, this isn't a huge problem as the data
underlying the page is not changing and so "detect and drop" works
just fine. The issues are to do with write faults.

When two write faults occur, we serialise block allocation in
get_blocks() so only one faul will allocate the extent. It will,
however, be marked as an unwritten extent, and that is where the
problem lies - the DAX fault code cannot differentiate between a
block that was just allocated and a block that was preallocated and
needs zeroing. The result is that both write faults end up zeroing
the block and attempting to convert it back to written.

The problem is that the first fault can zero and convert before the
second fault starts zeroing, resulting in the zeroing for the second
fault overwriting the data that the first fault wrote with zeros.
The second fault then attempts to convert the unwritten extent,
which is then a no-op because it's already written. Data loss occurs
as a result of this race.

Because there is no sane locking construct in the page fault code
that we can use for serialisation across the page faults, we need to
ensure block allocation and zeroing occurs atomically in the
filesystem. This means we can still take concurrent page faults and
the only time they will serialise is in the filesystem
mapping/allocation callback. The page fault code will always see
written, initialised extents, so we will be able to remove the
unwritten extent handling from the DAX code when all filesystems are
converted.

Signed-off-by: Dave Chinner 
---
 fs/xfs/xfs_aops.c  | 40 
 fs/xfs/xfs_aops.h  |  5 +
 fs/xfs/xfs_iomap.c | 38 +-
 3 files changed, 66 insertions(+), 17 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 50ab287..f645587 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -586,27 +586,35 @@ xfs_add_to_ioend(
ioend->io_size += bh->b_size;
 }
 
-STATIC void
-xfs_map_buffer(
+sector_t
+xfs_imap_to_sector(
struct inode*inode,
-   struct buffer_head  *bh,
struct xfs_bmbt_irec*imap,
xfs_off_t   offset)
 {
-   sector_tbn;
-   struct xfs_mount*m = XFS_I(inode)->i_mount;
-   xfs_off_t   iomap_offset = XFS_FSB_TO_B(m, 
imap->br_startoff);
-   xfs_daddr_t iomap_bn = xfs_fsb_to_db(XFS_I(inode), 
imap->br_startblock);
+   struct xfs_mount*mp = XFS_I(inode)->i_mount;
+   xfs_off_t   iomap_offset;
+   xfs_daddr_t iomap_bn;
 
ASSERT(imap->br_startblock != HOLESTARTBLOCK);
ASSERT(imap->br_startblock != DELAYSTARTBLOCK);
 
-   bn = (iomap_bn >> (inode->i_blkbits - BBSHIFT)) +
- ((offset - iomap_offset) >> inode->i_blkbits);
+   iomap_bn = xfs_fsb_to_db(XFS_I(inode), imap->br_startblock);
+   iomap_offset = XFS_FSB_TO_B(mp, imap->br_startoff);
 
-   ASSERT(bn || XFS_IS_REALTIME_INODE(XFS_I(inode)));
+   return iomap_bn + BTOBB(offset - iomap_offset);
+}
 
-   bh->b_blocknr = bn;
+STATIC void
+xfs_map_buffer(
+   struct inode*inode,
+   struct buffer_head  *bh,
+   struct xfs_bmbt_irec*imap,
+   xfs_off_t   offset)
+{
+   bh->b_blocknr = xfs_imap_to_sector(in

Re: [PATCH] dax: fix deadlock in __dax_fault

2015-09-27 Thread Dave Chinner
On Fri, Sep 25, 2015 at 09:17:45PM -0600, Ross Zwisler wrote:
> On Fri, Sep 25, 2015 at 12:53:57PM +1000, Dave Chinner wrote:
> > I think a bunch of reverts are in order -the code is broken, has
> > stale data exposure problems and we're going to have to rip the code
> > out anyway because I don't think this is the right way to fix the
> > underlying problem ("If two threads write-fault on the same hole at
> > the same time...")

[...]

> I think that at the end of the day the locking in the two DAX fault paths is
> trying to protect us against two things:
> 
> 1) Simultaneous write-fault on the same hole at the same time.
> 2) Races between page faults and truncate.

1) is fixed by pushing block zeroing into get_blocks()

2) is fixed by the filesystem locking out page faults during
truncate. XFS already does this with it's XFS_MMAPLOCK_[SHARED|EXCL]
locking on page faults and truncate...

> To get them all written down in one place, here are all the issues and
> questions that I currently know about regarding the DAX fault paths as of
> v4.3-rc2:
> 
> 1) In __dax_pmd_fault() the block inside the first "if (buffer_unwritten(&bh)

[fail to set kaddr, patch in -mm]

> 2) In __dax_pmd_fault() we have this code:

[fail to release i_mmap_lock_write()]

> 3) In __dax_pmd_fault() we convert the unwritten extent to written

> 4) In __dax_fault() we can return through this path:

[incorrect release of i_mmap_lock_write()]

> 5) In __dax_fault() we rely on the presence of a page pointer to know whether

[fail to release i_mmap_lock_write() on racing read faults]

> 6) In __dax_fault() in the vmf->cow_page case we can end up exiting with
>status VM_FAULT_LOCKED but with mapping->i_mmap_rwsem held.  This is

[undocumented, unbalanced, but handled]

> 7) In both __dax_fault() and __dax_pmd_fault() we call the filesystem's
>get_block() and complete_unwritten() functions while holding
>mapping->i_mmap_rwsem.  There is a concern that this could potentially lead
>to a lock inversion, leading to a deadlock.

[more investigation needed]

> Here's how I think we can address the above issues (basically "do what Dave
> said"):
> 
> 1) For the simultaneous write-fault issue, move the responsibility for zeroing
>the extents and converting them to written to the filesystem as Dave

*nod*

> 2) For the remaining uses of mapping->i_mmap_rwsem in the fault paths that
>protect us against truncate, address any remaining code issues listed
>above, clearly document in the fault handlers the locking rules and the
>exceptions to those rules (cow_page + VM_FAULT_LOCKED).  Update any locking
>order changes like in mm/filemap.c as necessary.

Incorrect. Truncate is a special case of hole punching, and the mm
subsystem has long relied on a nasty hack to serialise page faults
against truncate. That is:

lock_page(page);
if (page beyond EOF or mapping is different) {
/* OMG FAIL FAIL FAIL */
}

This "page beyond EOF" does not work for operations like hole
punching or other direct filesystem extent manipulation operations
that occur within the current EOF and require mapping invalidation
and serialisation until after the extent manipulation is done.

IOWs, all of these operations require the filesystem to prevent new
page faults from occurring after the initial invalidation and until
the extent manipulation is completed.  This includes truncate - the
"page beyond EOF" hack is not necessary anymore, because truncate
has the same invalidation behaviour w.r.t. punching out mappings
within EOF...

And with DAX, we have no struct page to lock, so there really isn't
anything in the mm subsystem that can be used to ensure that a given
range of a file is not being otherwise manipulated in a manner that
races with a page fault. The i_mmap_rwsem is not a replacement for
the page lock

In reality, the require DAX page fault vs truncate serialisation is
provided for XFS by the XFS_MMAPLOCK_* inode locking that is done in
the fault, mkwrite and filesystem extent manipulation paths. There
is no way this sort of exclusion can be done in the mm/ subsystem as
it simply does not have the context to be able to provide the
necessary serialisation.  Every filesystem that wants to use DAX
needs to provide this external page fault serialisation, and in
doing so will also protect it's hole punch/extent swap/shift
operations under normal operation against racing mmap access

IOWs, for DAX this needs to be fixed in ext4, not the mm/ subsystem.

> 3) For issue #7 above (lock inversion concerns between i_mmap_rwsem and the
>filesystem locking needed for get_block() and complete_unwritten()), I
>think that the complete_unwritten() calls in DAX will go away since we will
>be allocating, zeroing and converting extents to written all in the
>filesystem.  Similarly, I think that the calls to get_block() will no
>longer happen

Re: [PATCH] dax: fix deadlock in __dax_fault

2015-09-25 Thread Ross Zwisler
On Fri, Sep 25, 2015 at 12:53:57PM +1000, Dave Chinner wrote:
> On Thu, Sep 24, 2015 at 09:50:29AM -0600, Ross Zwisler wrote:
> > On Thu, Sep 24, 2015 at 12:52:25PM +1000, Dave Chinner wrote:
> > > On Wed, Sep 23, 2015 at 02:40:00PM -0600, Ross Zwisler wrote:
> > > > Fix the deadlock exposed by xfstests generic/075.  Here is the sequence
> > > > that was causing us to deadlock:
> > > > 
> > > > 1) enter __dax_fault()
> > > > 2) page = find_get_page() gives us a page, so skip
> > > > i_mmap_lock_write(mapping)
> > > > 3) if (!buffer_mapped(&bh) && !buffer_unwritten(&bh) && !vmf->cow_page)
> > > > passes, enter this block
> > > > 4) if (vmf->flags & FAULT_FLAG_WRITE) fails, so do the else case and
> > > > i_mmap_unlock_write(mapping);
> > > > return dax_load_hole(mapping, page, vmf);
> > > > 
> > > > This causes us to up_write() a semaphore that we weren't holding.
> > > > 
> > > > The up_write() on a semaphore we didn't down_write() happens twice in
> > > > a row, and then the next time we try and i_mmap_lock_write(), we hang.
> > > > 
> > > > Signed-off-by: Ross Zwisler 
> > > > Reported-by: Dave Chinner 
> > > > ---
> > > >  fs/dax.c | 3 ++-
> > > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/fs/dax.c b/fs/dax.c
> > > > index 7ae6df7..df1b0ac 100644
> > > > --- a/fs/dax.c
> > > > +++ b/fs/dax.c
> > > > @@ -405,7 +405,8 @@ int __dax_fault(struct vm_area_struct *vma, struct 
> > > > vm_fault *vmf,
> > > > if (error)
> > > > goto unlock;
> > > > } else {
> > > > -   i_mmap_unlock_write(mapping);
> > > > +   if (!page)
> > > > +   i_mmap_unlock_write(mapping);
> > > > return dax_load_hole(mapping, page, vmf);
> > > > }
> > > > }
> > > 
> > > I can't review this properly because I can't work out how this
> > > locking is supposed to work.  Captain, we have a Charlie Foxtrot
> > > situation here:
> > > 
> > >   page = find_get_page(mapping, vmf->pgoff)
> > >   if (page) {
> > >   
> > >   } else {
> > >   i_mmap_lock_write(mapping);
> > >   }
> > > 
> > > So if there's no page in the page cache, we lock the i_mmap_lock.
> > > The we have the case the above patch fixes. Then later:
> > > 
> > >   if (vmf->cow_page) {
> > >   .
> > >   if (!page) {
> > >   /* can fall through */
> > >   }
> > >   return VM_FAULT_LOCKED;
> > >   }
> > > 
> > > Which means __dax_fault() can also return here with the
> > > i_mmap_lock_write() held. There's no documentation to indicate why
> > > this is valid, and only by looking about 4 function calls higher up
> > > the stack can I see that there's some attempt to handle this
> > > *specific return condition* (in do_cow_fault()). That also is
> > > lacking in documentation explaining the circumstances where we might
> > > have the i_mmap_lock_write() held and have to release it. (Not to
> > > mention the beautiful copy-n-waste of the unlock code, either.)
> > > 
> > > The above code in __dax_fault() is then followed by this gem:
> > > 
> > >   /* Check we didn't race with a read fault installing a new page */
> > > if (!page && major)
> > > page = find_lock_page(mapping, vmf->pgoff);
> > > 
> > >   if (page) {
> > >   /* mapping invalidation  */
> > >   }
> > >   .
> > > 
> > >   if (!page)
> > >   i_mmap_unlock_write(mapping);
> > > 
> > > Which means that if we had a race with another read fault, we'll
> > > remove the page from the page cache, insert the new direct mapped
> > > pfn into the mapping, and *then fail to unlock the i_mmap lock*.
> > > 
> > > Is this supposed to work this way? Or is it another bug?
> > > 
> > > Another difficult question this change of locking raised that I
> > > can't answer: is it valid to call into the filesystem via getblock()
> > > or complete_unwritten() while holding the i_mmap_rwsem? This puts
> > > filesystem transactions and locks inside the scope of i_mmap_rwsem,
> > > which may have impact on the fact that we already have an inode lock
> > > order dependency w.r.t. i_mmap_rwsem through truncate (and probably
> > > other paths, too).
> > > 
> > > So, please document the locking model, explain the corner cases and
> > > the intricacies like why *unbalanced, return value conditional
> > > locking* is necessary, and update the charts of lock order
> > > dependencies in places like mm/filemap.c, and then we might have
> > > some idea of how much of a train-wreck this actually is
> > 
> > Yep, I saw these too, but didn't yet know how to deal with them.  We have at
> > similar issues with __dax_pmd_fault():
> > 
> > i_mmap_lock_write(mapping);
> > length = get_block(inode, block, &bh, write);
> > if (length)
> > return VM_FAULT_SIGBUS;
> > 
> > Whoops

Re: [PATCH] dax: fix deadlock in __dax_fault

2015-09-25 Thread Dave Chinner
On Fri, Sep 25, 2015 at 12:23:34PM -0600, Ross Zwisler wrote:
> On Fri, Sep 25, 2015 at 12:53:57PM +1000, Dave Chinner wrote:
> <>
> > We've already got block allocation serialisation at the filesystem
> > level, and the issue is the unserialised block zeroing being done by
> > the dax code. That can be fixed by moving the zeroing into the
> > filesystem code when it runs "complete_unwritten" and checks whether
> > the mapping has already been marked as written or not...
> > 
> > I've recently pointed out in a different thread that this is the
> > solution to whatever that problem was (can't recall which
> > thread/problem is was now :/ ) and it the same solution here. We
> > already have the serialisation we need, we just need to move the
> > block zeroing operation into the appropriate places to make it work
> > correctly.
> 
> I think perhaps this is the thread that you're remembering:
> https://lkml.org/lkml/2015/8/11/731

Ah, yes, the read vs write fault race condition, whereas this change
was to address write vs write fault races. And neither of which
address the fault vs truncate problem properly, either, which is
what the XFS_MMAP_LOCKING addresses.

So, yeah, pushing the block zeroing down to the filesystem gets rid
of multiple problems that concurrent page faults have with
initialisation without adding any more serialisation or overhead.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] dax: fix deadlock in __dax_fault

2015-09-25 Thread Ross Zwisler
On Fri, Sep 25, 2015 at 12:53:57PM +1000, Dave Chinner wrote:
<>
> We've already got block allocation serialisation at the filesystem
> level, and the issue is the unserialised block zeroing being done by
> the dax code. That can be fixed by moving the zeroing into the
> filesystem code when it runs "complete_unwritten" and checks whether
> the mapping has already been marked as written or not...
> 
> I've recently pointed out in a different thread that this is the
> solution to whatever that problem was (can't recall which
> thread/problem is was now :/ ) and it the same solution here. We
> already have the serialisation we need, we just need to move the
> block zeroing operation into the appropriate places to make it work
> correctly.

I think perhaps this is the thread that you're remembering:
https://lkml.org/lkml/2015/8/11/731
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] dax: fix deadlock in __dax_fault

2015-09-24 Thread Dave Chinner
On Thu, Sep 24, 2015 at 09:50:29AM -0600, Ross Zwisler wrote:
> On Thu, Sep 24, 2015 at 12:52:25PM +1000, Dave Chinner wrote:
> > On Wed, Sep 23, 2015 at 02:40:00PM -0600, Ross Zwisler wrote:
> > > Fix the deadlock exposed by xfstests generic/075.  Here is the sequence
> > > that was causing us to deadlock:
> > > 
> > > 1) enter __dax_fault()
> > > 2) page = find_get_page() gives us a page, so skip
> > >   i_mmap_lock_write(mapping)
> > > 3) if (!buffer_mapped(&bh) && !buffer_unwritten(&bh) && !vmf->cow_page)
> > >   passes, enter this block
> > > 4) if (vmf->flags & FAULT_FLAG_WRITE) fails, so do the else case and
> > >   i_mmap_unlock_write(mapping);
> > >   return dax_load_hole(mapping, page, vmf);
> > > 
> > > This causes us to up_write() a semaphore that we weren't holding.
> > > 
> > > The up_write() on a semaphore we didn't down_write() happens twice in
> > > a row, and then the next time we try and i_mmap_lock_write(), we hang.
> > > 
> > > Signed-off-by: Ross Zwisler 
> > > Reported-by: Dave Chinner 
> > > ---
> > >  fs/dax.c | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/dax.c b/fs/dax.c
> > > index 7ae6df7..df1b0ac 100644
> > > --- a/fs/dax.c
> > > +++ b/fs/dax.c
> > > @@ -405,7 +405,8 @@ int __dax_fault(struct vm_area_struct *vma, struct 
> > > vm_fault *vmf,
> > >   if (error)
> > >   goto unlock;
> > >   } else {
> > > - i_mmap_unlock_write(mapping);
> > > + if (!page)
> > > + i_mmap_unlock_write(mapping);
> > >   return dax_load_hole(mapping, page, vmf);
> > >   }
> > >   }
> > 
> > I can't review this properly because I can't work out how this
> > locking is supposed to work.  Captain, we have a Charlie Foxtrot
> > situation here:
> > 
> > page = find_get_page(mapping, vmf->pgoff)
> > if (page) {
> > 
> > } else {
> > i_mmap_lock_write(mapping);
> > }
> > 
> > So if there's no page in the page cache, we lock the i_mmap_lock.
> > The we have the case the above patch fixes. Then later:
> > 
> > if (vmf->cow_page) {
> > .
> > if (!page) {
> > /* can fall through */
> > }
> > return VM_FAULT_LOCKED;
> > }
> > 
> > Which means __dax_fault() can also return here with the
> > i_mmap_lock_write() held. There's no documentation to indicate why
> > this is valid, and only by looking about 4 function calls higher up
> > the stack can I see that there's some attempt to handle this
> > *specific return condition* (in do_cow_fault()). That also is
> > lacking in documentation explaining the circumstances where we might
> > have the i_mmap_lock_write() held and have to release it. (Not to
> > mention the beautiful copy-n-waste of the unlock code, either.)
> > 
> > The above code in __dax_fault() is then followed by this gem:
> > 
> > /* Check we didn't race with a read fault installing a new page */
> > if (!page && major)
> > page = find_lock_page(mapping, vmf->pgoff);
> > 
> > if (page) {
> > /* mapping invalidation  */
> > }
> > .
> > 
> > if (!page)
> > i_mmap_unlock_write(mapping);
> > 
> > Which means that if we had a race with another read fault, we'll
> > remove the page from the page cache, insert the new direct mapped
> > pfn into the mapping, and *then fail to unlock the i_mmap lock*.
> > 
> > Is this supposed to work this way? Or is it another bug?
> > 
> > Another difficult question this change of locking raised that I
> > can't answer: is it valid to call into the filesystem via getblock()
> > or complete_unwritten() while holding the i_mmap_rwsem? This puts
> > filesystem transactions and locks inside the scope of i_mmap_rwsem,
> > which may have impact on the fact that we already have an inode lock
> > order dependency w.r.t. i_mmap_rwsem through truncate (and probably
> > other paths, too).
> > 
> > So, please document the locking model, explain the corner cases and
> > the intricacies like why *unbalanced, return value conditional
> > locking* is necessary, and update the charts of lock order
> > dependencies in places like mm/filemap.c, and then we might have
> > some idea of how much of a train-wreck this actually is
> 
> Yep, I saw these too, but didn't yet know how to deal with them.  We have at
> similar issues with __dax_pmd_fault():
> 
>   i_mmap_lock_write(mapping);
>   length = get_block(inode, block, &bh, write);
>   if (length)
>   return VM_FAULT_SIGBUS;
> 
> Whoops, we just took the mmap semaphore, and then we have a return that
> doesn't release it.  A quick test confirms that this will deadlock the next
> fault that tries to take the mmap semaphore.

Ugh. So there's more than one broken commit and code path we are
talking about here.

Oh, even the "fallback" path is broken there -

Re: [PATCH] dax: fix deadlock in __dax_fault

2015-09-24 Thread Ross Zwisler
On Thu, Sep 24, 2015 at 12:52:25PM +1000, Dave Chinner wrote:
> On Wed, Sep 23, 2015 at 02:40:00PM -0600, Ross Zwisler wrote:
> > Fix the deadlock exposed by xfstests generic/075.  Here is the sequence
> > that was causing us to deadlock:
> > 
> > 1) enter __dax_fault()
> > 2) page = find_get_page() gives us a page, so skip
> > i_mmap_lock_write(mapping)
> > 3) if (!buffer_mapped(&bh) && !buffer_unwritten(&bh) && !vmf->cow_page)
> > passes, enter this block
> > 4) if (vmf->flags & FAULT_FLAG_WRITE) fails, so do the else case and
> > i_mmap_unlock_write(mapping);
> > return dax_load_hole(mapping, page, vmf);
> > 
> > This causes us to up_write() a semaphore that we weren't holding.
> > 
> > The up_write() on a semaphore we didn't down_write() happens twice in
> > a row, and then the next time we try and i_mmap_lock_write(), we hang.
> > 
> > Signed-off-by: Ross Zwisler 
> > Reported-by: Dave Chinner 
> > ---
> >  fs/dax.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 7ae6df7..df1b0ac 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -405,7 +405,8 @@ int __dax_fault(struct vm_area_struct *vma, struct 
> > vm_fault *vmf,
> > if (error)
> > goto unlock;
> > } else {
> > -   i_mmap_unlock_write(mapping);
> > +   if (!page)
> > +   i_mmap_unlock_write(mapping);
> > return dax_load_hole(mapping, page, vmf);
> > }
> > }
> 
> I can't review this properly because I can't work out how this
> locking is supposed to work.  Captain, we have a Charlie Foxtrot
> situation here:
> 
>   page = find_get_page(mapping, vmf->pgoff)
>   if (page) {
>   
>   } else {
>   i_mmap_lock_write(mapping);
>   }
> 
> So if there's no page in the page cache, we lock the i_mmap_lock.
> The we have the case the above patch fixes. Then later:
> 
>   if (vmf->cow_page) {
>   .
>   if (!page) {
>   /* can fall through */
>   }
>   return VM_FAULT_LOCKED;
>   }
> 
> Which means __dax_fault() can also return here with the
> i_mmap_lock_write() held. There's no documentation to indicate why
> this is valid, and only by looking about 4 function calls higher up
> the stack can I see that there's some attempt to handle this
> *specific return condition* (in do_cow_fault()). That also is
> lacking in documentation explaining the circumstances where we might
> have the i_mmap_lock_write() held and have to release it. (Not to
> mention the beautiful copy-n-waste of the unlock code, either.)
> 
> The above code in __dax_fault() is then followed by this gem:
> 
>   /* Check we didn't race with a read fault installing a new page */
> if (!page && major)
> page = find_lock_page(mapping, vmf->pgoff);
> 
>   if (page) {
>   /* mapping invalidation  */
>   }
>   .
> 
>   if (!page)
>   i_mmap_unlock_write(mapping);
> 
> Which means that if we had a race with another read fault, we'll
> remove the page from the page cache, insert the new direct mapped
> pfn into the mapping, and *then fail to unlock the i_mmap lock*.
> 
> Is this supposed to work this way? Or is it another bug?
> 
> Another difficult question this change of locking raised that I
> can't answer: is it valid to call into the filesystem via getblock()
> or complete_unwritten() while holding the i_mmap_rwsem? This puts
> filesystem transactions and locks inside the scope of i_mmap_rwsem,
> which may have impact on the fact that we already have an inode lock
> order dependency w.r.t. i_mmap_rwsem through truncate (and probably
> other paths, too).
> 
> So, please document the locking model, explain the corner cases and
> the intricacies like why *unbalanced, return value conditional
> locking* is necessary, and update the charts of lock order
> dependencies in places like mm/filemap.c, and then we might have
> some idea of how much of a train-wreck this actually is

Yep, I saw these too, but didn't yet know how to deal with them.  We have at
similar issues with __dax_pmd_fault():

i_mmap_lock_write(mapping);
length = get_block(inode, block, &bh, write);
if (length)
return VM_FAULT_SIGBUS;

Whoops, we just took the mmap semaphore, and then we have a return that
doesn't release it.  A quick test confirms that this will deadlock the next
fault that tries to take the mmap semaphore.

I agree that we need to give the fault handling code some attention when it
comes to locking, and that we need to have better documentation.  I'll work on
this when I get some time.

In the meantime I still think it's worthwhile to take the short term fix for
the obvious generic/075 deadlock, yea?

- Ross
--
To unsubscribe from this list: send the line

Re: [PATCH] dax: fix deadlock in __dax_fault

2015-09-24 Thread Boaz Harrosh
On 09/24/2015 05:52 AM, Dave Chinner wrote:
> On Wed, Sep 23, 2015 at 02:40:00PM -0600, Ross Zwisler wrote:
>> Fix the deadlock exposed by xfstests generic/075.  Here is the sequence
>> that was causing us to deadlock:
>>
>> 1) enter __dax_fault()
>> 2) page = find_get_page() gives us a page, so skip
>>  i_mmap_lock_write(mapping)
>> 3) if (!buffer_mapped(&bh) && !buffer_unwritten(&bh) && !vmf->cow_page)
>>  passes, enter this block
>> 4) if (vmf->flags & FAULT_FLAG_WRITE) fails, so do the else case and
>>  i_mmap_unlock_write(mapping);
>>  return dax_load_hole(mapping, page, vmf);
>>
>> This causes us to up_write() a semaphore that we weren't holding.
>>
>> The up_write() on a semaphore we didn't down_write() happens twice in
>> a row, and then the next time we try and i_mmap_lock_write(), we hang.
>>
>> Signed-off-by: Ross Zwisler 
>> Reported-by: Dave Chinner 
>> ---
>>  fs/dax.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/dax.c b/fs/dax.c
>> index 7ae6df7..df1b0ac 100644
>> --- a/fs/dax.c
>> +++ b/fs/dax.c
>> @@ -405,7 +405,8 @@ int __dax_fault(struct vm_area_struct *vma, struct 
>> vm_fault *vmf,
>>  if (error)
>>  goto unlock;
>>  } else {
>> -i_mmap_unlock_write(mapping);
>> +if (!page)
>> +i_mmap_unlock_write(mapping);
>>  return dax_load_hole(mapping, page, vmf);
>>  }
>>  }
> 
> I can't review this properly because I can't work out how this
> locking is supposed to work.  Captain, we have a Charlie Foxtrot
> situation here:
> 
>   page = find_get_page(mapping, vmf->pgoff)
>   if (page) {
>   
>   } else {
>   i_mmap_lock_write(mapping);
>   }
> 
> So if there's no page in the page cache, we lock the i_mmap_lock.
> The we have the case the above patch fixes. Then later:
> 
>   if (vmf->cow_page) {
>   .
>   if (!page) {
>   /* can fall through */
>   }
>   return VM_FAULT_LOCKED;
>   }
> 
> Which means __dax_fault() can also return here with the
> i_mmap_lock_write() held. There's no documentation to indicate why
> this is valid, and only by looking about 4 function calls higher up
> the stack can I see that there's some attempt to handle this
> *specific return condition* (in do_cow_fault()). That also is
> lacking in documentation explaining the circumstances where we might
> have the i_mmap_lock_write() held and have to release it. (Not to
> mention the beautiful copy-n-waste of the unlock code, either.)
> 
> The above code in __dax_fault() is then followed by this gem:
> 
>   /* Check we didn't race with a read fault installing a new page */
> if (!page && major)
> page = find_lock_page(mapping, vmf->pgoff);
> 
>   if (page) {
>   /* mapping invalidation  */
>   }
>   .
> 
>   if (!page)
>   i_mmap_unlock_write(mapping);
> 
> Which means that if we had a race with another read fault, we'll
> remove the page from the page cache, insert the new direct mapped
> pfn into the mapping, and *then fail to unlock the i_mmap lock*.
> 
> Is this supposed to work this way? Or is it another bug?
> 
> Another difficult question this change of locking raised that I
> can't answer: is it valid to call into the filesystem via getblock()
> or complete_unwritten() while holding the i_mmap_rwsem? This puts
> filesystem transactions and locks inside the scope of i_mmap_rwsem,
> which may have impact on the fact that we already have an inode lock
> order dependency w.r.t. i_mmap_rwsem through truncate (and probably
> other paths, too).
> 
> So, please document the locking model, explain the corner cases and
> the intricacies like why *unbalanced, return value conditional
> locking* is necessary, and update the charts of lock order
> dependencies in places like mm/filemap.c, and then we might have
> some idea of how much of a train-wreck this actually is
> 

Hi hi

I hate this VM_FAULT_LOCKED + !page which means i_mmap_lock. I still
think it solves nothing and that we've done a really really bad job.

If we *easily* involve the FS in the locking here (Which btw I think XFS
already does), then this all i_mmap_lock can be avoided.

Please remind me again what race it is suppose to avoid? I get confused.

> Cheers,
> Dave.
> 

Thanks
Boaz

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] dax: fix deadlock in __dax_fault

2015-09-23 Thread Dave Chinner
On Wed, Sep 23, 2015 at 02:40:00PM -0600, Ross Zwisler wrote:
> Fix the deadlock exposed by xfstests generic/075.  Here is the sequence
> that was causing us to deadlock:
> 
> 1) enter __dax_fault()
> 2) page = find_get_page() gives us a page, so skip
>   i_mmap_lock_write(mapping)
> 3) if (!buffer_mapped(&bh) && !buffer_unwritten(&bh) && !vmf->cow_page)
>   passes, enter this block
> 4) if (vmf->flags & FAULT_FLAG_WRITE) fails, so do the else case and
>   i_mmap_unlock_write(mapping);
>   return dax_load_hole(mapping, page, vmf);
> 
> This causes us to up_write() a semaphore that we weren't holding.
> 
> The up_write() on a semaphore we didn't down_write() happens twice in
> a row, and then the next time we try and i_mmap_lock_write(), we hang.
> 
> Signed-off-by: Ross Zwisler 
> Reported-by: Dave Chinner 
> ---
>  fs/dax.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 7ae6df7..df1b0ac 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -405,7 +405,8 @@ int __dax_fault(struct vm_area_struct *vma, struct 
> vm_fault *vmf,
>   if (error)
>   goto unlock;
>   } else {
> - i_mmap_unlock_write(mapping);
> + if (!page)
> + i_mmap_unlock_write(mapping);
>   return dax_load_hole(mapping, page, vmf);
>   }
>   }

I can't review this properly because I can't work out how this
locking is supposed to work.  Captain, we have a Charlie Foxtrot
situation here:

page = find_get_page(mapping, vmf->pgoff)
if (page) {

} else {
i_mmap_lock_write(mapping);
}

So if there's no page in the page cache, we lock the i_mmap_lock.
The we have the case the above patch fixes. Then later:

if (vmf->cow_page) {
.
if (!page) {
/* can fall through */
}
return VM_FAULT_LOCKED;
}

Which means __dax_fault() can also return here with the
i_mmap_lock_write() held. There's no documentation to indicate why
this is valid, and only by looking about 4 function calls higher up
the stack can I see that there's some attempt to handle this
*specific return condition* (in do_cow_fault()). That also is
lacking in documentation explaining the circumstances where we might
have the i_mmap_lock_write() held and have to release it. (Not to
mention the beautiful copy-n-waste of the unlock code, either.)

The above code in __dax_fault() is then followed by this gem:

/* Check we didn't race with a read fault installing a new page */
if (!page && major)
page = find_lock_page(mapping, vmf->pgoff);

if (page) {
/* mapping invalidation  */
}
.

if (!page)
i_mmap_unlock_write(mapping);

Which means that if we had a race with another read fault, we'll
remove the page from the page cache, insert the new direct mapped
pfn into the mapping, and *then fail to unlock the i_mmap lock*.

Is this supposed to work this way? Or is it another bug?

Another difficult question this change of locking raised that I
can't answer: is it valid to call into the filesystem via getblock()
or complete_unwritten() while holding the i_mmap_rwsem? This puts
filesystem transactions and locks inside the scope of i_mmap_rwsem,
which may have impact on the fact that we already have an inode lock
order dependency w.r.t. i_mmap_rwsem through truncate (and probably
other paths, too).

So, please document the locking model, explain the corner cases and
the intricacies like why *unbalanced, return value conditional
locking* is necessary, and update the charts of lock order
dependencies in places like mm/filemap.c, and then we might have
some idea of how much of a train-wreck this actually is

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] dax: fix deadlock in __dax_fault

2015-09-23 Thread Ross Zwisler
Fix the deadlock exposed by xfstests generic/075.  Here is the sequence
that was causing us to deadlock:

1) enter __dax_fault()
2) page = find_get_page() gives us a page, so skip
i_mmap_lock_write(mapping)
3) if (!buffer_mapped(&bh) && !buffer_unwritten(&bh) && !vmf->cow_page)
passes, enter this block
4) if (vmf->flags & FAULT_FLAG_WRITE) fails, so do the else case and
i_mmap_unlock_write(mapping);
return dax_load_hole(mapping, page, vmf);

This causes us to up_write() a semaphore that we weren't holding.

The up_write() on a semaphore we didn't down_write() happens twice in
a row, and then the next time we try and i_mmap_lock_write(), we hang.

Signed-off-by: Ross Zwisler 
Reported-by: Dave Chinner 
---
 fs/dax.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index 7ae6df7..df1b0ac 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -405,7 +405,8 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault 
*vmf,
if (error)
goto unlock;
} else {
-   i_mmap_unlock_write(mapping);
+   if (!page)
+   i_mmap_unlock_write(mapping);
return dax_load_hole(mapping, page, vmf);
}
}
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/