Re: [PATCH 2/3] mm, dax, pmem: Introduce dev_pagemap_failure()

2021-03-19 Thread Dave Chinner
On Thu, Mar 18, 2021 at 12:20:35PM -0700, Dan Williams wrote:
> On Wed, Mar 17, 2021 at 9:58 PM Dave Chinner  wrote:
> >
> > On Wed, Mar 17, 2021 at 09:08:23PM -0700, Dan Williams wrote:
> > > Jason wondered why the get_user_pages_fast() path takes references on a
> > > @pgmap object. The rationale was to protect against accessing a 'struct
> > > page' that might be in the process of being removed by the driver, but
> > > he rightly points out that should be solved the same way all gup-fast
> > > synchronization is solved which is invalidate the mapping and let the
> > > gup slow path do @pgmap synchronization [1].
> > >
> > > To achieve that it means that new user mappings need to stop being
> > > created and all existing user mappings need to be invalidated.
> > >
> > > For device-dax this is already the case as kill_dax() prevents future
> > > faults from installing a pte, and the single device-dax inode
> > > address_space can be trivially unmapped.
> > >
> > > The situation is different for filesystem-dax where device pages could
> > > be mapped by any number of inode address_space instances. An initial
> > > thought was to treat the device removal event like a drop_pagecache_sb()
> > > event that walks superblocks and unmaps all inodes. However, Dave points
> > > out that it is not just the filesystem user-mappings that need to react
> > > to global DAX page-unmap events, it is also filesystem metadata
> > > (proposed DAX metadata access), and other drivers (upstream
> > > DM-writecache) that need to react to this event [2].
> > >
> > > The only kernel facility that is meant to globally broadcast the loss of
> > > a page (via corruption or surprise remove) is memory_failure(). The
> > > downside of memory_failure() is that it is a pfn-at-a-time interface.
> > > However, the events that would trigger the need to call memory_failure()
> > > over a full PMEM device should be rare.
> >
> > This is a highly suboptimal design. Filesystems only need a single
> > callout to trigger a shutdown that unmaps every active mapping in
> > the filesystem - we do not need a page-by-page error notification
> > which results in 250 million hwposion callouts per TB of pmem to do
> > this.
> >
> > Indeed, the moment we get the first hwpoison from this patch, we'll
> > map it to the primary XFS superblock and we'd almost certainly
> > consider losing the storage behind that block to be a shut down
> > trigger. During the shutdown, the filesystem should unmap all the
> > active mappings (we already need to add this to shutdown on DAX
> > regardless of this device remove issue) and so we really don't need
> > a page-by-page notification of badness.
> 
> XFS doesn't, but what about device-mapper and other agents? Even if
> the driver had a callback up the stack memory_failure() still needs to
> be able to trigger failures down the stack for CPU consumed poison.

If the device is gone, then they don't need page by page
notifucation, either. Tell them the entire device is gone so they
can do what they need (like pass it up to the filesystem as ranges
of badness!).

> > AFAICT, it's going to take minutes, maybe hours for do the page-by-page
> > iteration to hwposion every page. It's going to take a few seconds
> > for the filesystem shutdown to run a device wide invalidation.
> >
> > SO, yeah, I think this should simply be a single ranged call to the
> > filesystem like:
> >
> > ->memory_failure(dev, 0, -1ULL)
> >
> > to tell the filesystem that the entire backing device has gone away,
> > and leave the filesystem to handle failure entirely at the
> > filesystem level.
> 
> So I went with memory_failure() after our discussion of all the other
> agents in the system that might care about these pfns going offline
> and relying on memory_failure() to route down to each of those. I.e.
> the "reuse the drop_pagecache_sb() model" idea was indeed
> insufficient.

Using drop_pagecache_sb is insufficient because filesystems have
more than just inode indexed caches that pmem failure may affect.
This is not an argument against a "knock everything down at once"
notification model, just that drop_pagecache_sb() is ...
insufficient to do what we need...

> Now I'm trying to reconcile the fact that platform
> poison handling will hit memory_failure() first and may not
> immediately reach the driver, if ever (see the perennially awkward
> firmware-first-mode error handling: ghes_handle_memory_failure()) . So
> even if the ->memory_failu

Re: [PATCH 2/3] mm, dax, pmem: Introduce dev_pagemap_failure()

2021-03-17 Thread Dave Chinner
On Wed, Mar 17, 2021 at 09:08:23PM -0700, Dan Williams wrote:
> Jason wondered why the get_user_pages_fast() path takes references on a
> @pgmap object. The rationale was to protect against accessing a 'struct
> page' that might be in the process of being removed by the driver, but
> he rightly points out that should be solved the same way all gup-fast
> synchronization is solved which is invalidate the mapping and let the
> gup slow path do @pgmap synchronization [1].
> 
> To achieve that it means that new user mappings need to stop being
> created and all existing user mappings need to be invalidated.
> 
> For device-dax this is already the case as kill_dax() prevents future
> faults from installing a pte, and the single device-dax inode
> address_space can be trivially unmapped.
> 
> The situation is different for filesystem-dax where device pages could
> be mapped by any number of inode address_space instances. An initial
> thought was to treat the device removal event like a drop_pagecache_sb()
> event that walks superblocks and unmaps all inodes. However, Dave points
> out that it is not just the filesystem user-mappings that need to react
> to global DAX page-unmap events, it is also filesystem metadata
> (proposed DAX metadata access), and other drivers (upstream
> DM-writecache) that need to react to this event [2].
> 
> The only kernel facility that is meant to globally broadcast the loss of
> a page (via corruption or surprise remove) is memory_failure(). The
> downside of memory_failure() is that it is a pfn-at-a-time interface.
> However, the events that would trigger the need to call memory_failure()
> over a full PMEM device should be rare.

This is a highly suboptimal design. Filesystems only need a single
callout to trigger a shutdown that unmaps every active mapping in
the filesystem - we do not need a page-by-page error notification
which results in 250 million hwposion callouts per TB of pmem to do
this.

Indeed, the moment we get the first hwpoison from this patch, we'll
map it to the primary XFS superblock and we'd almost certainly
consider losing the storage behind that block to be a shut down
trigger. During the shutdown, the filesystem should unmap all the
active mappings (we already need to add this to shutdown on DAX
regardless of this device remove issue) and so we really don't need
a page-by-page notification of badness.

AFAICT, it's going to take minutes, maybe hours for do the page-by-page
iteration to hwposion every page. It's going to take a few seconds
for the filesystem shutdown to run a device wide invalidation.

SO, yeah, I think this should simply be a single ranged call to the
filesystem like:

->memory_failure(dev, 0, -1ULL)

to tell the filesystem that the entire backing device has gone away,
and leave the filesystem to handle failure entirely at the
filesystem level.

-Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: Question about the "EXPERIMENTAL" tag for dax in XFS

2021-03-01 Thread Dave Chinner
 using dax for metadata, to take a fault when pmem is
> offlined.

 if userspace has directly mapped into the cache, and the cache
storage goes away, the userspace app has to be killed because we
have no idea if the device going away has caused data loss or not.
IOWs, if userspace writes direct to the cache device and it hasn't
been written back to other storage when it gets yanked, we have just
caused data corruption to occur.

At minimum, we now have to tell the filesystem that the dirty data
in the cache is now bad, and direct map applications that map those
dirty ranges need to be killed because their backing store is no
longer valid nor does the backup copy contain the data they last
wrote. Nor is it acessible by direct access, which is going to be
interesting because dynamically changing dax to non-dax access can't
be done without forcibly kicking the inode out of the cache. That
requires all references to the inode to go away. And that means the
event really has to go up to the filesystem.

But I think the biggest piece of the puzzle that you haven't grokked
here is that the dm cache device isn't a linear map - it's made up of
random ranges from the underlying devices. Hence the "remove" of a dm
cache device turns into a huge number of small, sparse corrupt
ranges, not a single linear device remove event.

IOWs, device unplug/remove events are not just simple "pass it on"
events in a stacked storage setup. There can be non-trivial mappings
through the layers, and device disappearance may in fact manifest to
the user as data corruption rather than causing data to be
inaccessible.

Hence "remove" notifications just don't work in the storage stack.
They need to be translated to block ranges going bad (i.e.  media
errors), and reported to higher layers as bad ranges, not as device
removal.

The same goes for DAX devices. The moment they can be placed in
storage stacks in non-trivial configurations and/or used as cache
devices that can be directly accessed over tranditional block
devices, we end up with error conditions that can only be mapped as
ranges of blocks that have gone bad.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: Question about the "EXPERIMENTAL" tag for dax in XFS

2021-03-01 Thread Dave Chinner
On Mon, Mar 01, 2021 at 07:33:28PM -0800, Dan Williams wrote:
> On Mon, Mar 1, 2021 at 6:42 PM Dave Chinner  wrote:
> [..]
> > We do not need a DAX specific mechanism to tell us "DAX device
> > gone", we need a generic block device interface that tells us "range
> > of block device is gone".
> 
> This is the crux of the disagreement. The block_device is going away
> *and* the dax_device is going away.

No, that is not the disagreement I have with what you are saying.
You still haven't understand that it's even more basic and generic
than devices going away. At the simplest form, all the filesystem
wants is to be notified of is when *unrecoverable media errors*
occur in the persistent storage that underlies the filesystem.

The filesystem does not care what that media is build from - PMEM,
flash, corroded spinning disks, MRAM, or any other persistent media
you can think off. It just doesn't matter.

What we care about is that the contents of a *specific LBA range* no
longer contain *valid data*. IOWs, the data in that range of the
block device has been lost, cannot be retreived and/or cannot be
written to any more.

PMEM taking a MCE because ECC tripped is a media error because data
is lost and inaccessible until recovery actions are taken.

MD RAID failing a scrub is a media error and data is lost and
unrecoverable at that layer.

A device disappearing is a media error because the storage media is
now permanently inaccessible to the higher layers.

This "media error" categorisation is a fundamental property of
persistent storage and, as such, is a property of the block devices
used to access said persistent storage.

That's the disagreement here - that you and Christoph are saying
->corrupted_range is not a block device property because only a
pmem/DAX device currently generates it.

You both seem to be NACKing a generic interface because it's only
implemented for the first subsystem that needs it. AFAICT, you
either don't understand or are completely ignoring the architectural
need for it to be provided across the rest of the storage stack that
*block device based filesystems depend on*.

Sure, there might be dax device based fielsystems around the corner.
They just require a different pmem device ->corrupted_range callout
to implement the notification - one that directs to the dax device
rather than the block device. That's simple and trivial to
implement, but such functionaity for DAX devices  does not replace
the need for the same generic functionality to be provided across a
*range of different block devices* as required by *block device
based filesystems*.

And that's fundamentally the problem. XFS is block device based, not
DAX device based. We require errors to be reported through block
device mechanisms. fs-dax does not change this - it is based on pmem
being presented as a primarily as a block device to the block device
based filesystems and only secondarily as a dax device. Hence if it
can be trivially implemented as a block device interface, that's
where it should go, because then all the other block devices that
the filesytem runs on can provide the same functionality for similar
media error events

> The dax_device removal implies one
> set of actions (direct accessed pfns invalid) the block device removal
> implies another (block layer sector access offline).

There you go again, saying DAX requires an action, while the block
device notification is a -state change- (i.e. goes offline).

This is exactly what I said was wrong in my last email.

> corrupted_range
> is blurring the notification for 2 different failure domains. Look at
> the nascent idea to mount a filesystem on dax sans a block device.
> Look at the existing plumbing for DM to map dax_operations through a
> device stack.

Ummm, it just maps the direct_access call to the underlying device
and calls it's ->direct_access method. All it's doing is LBA
mapping. That's all it needs to do for ->corrupted_range, too.
I have no clue why you think this is a problem for error
notification...

> Look at the pushback Ruan got for adding a new
> block_device operation for corrupted_range().

one person said "no". That's hardly pushback. Especially as I think
Christoph's objection about this being dax specific functionality
is simply wrong, as per above.

> > This is why we need to communicate what error occurred, not what
> > action a device driver thinks needs to be taken.
> 
> The driver is only an event producer in this model, whatever the
> consumer does at the other end is not its concern. There may be a
> generic consumer and a filesystem specific consumer.



That's why these are all ops functions that can provide multiple
implementations to different device types. So that when we get a new
use case, the ops function structure can be replaced with one that
directs the notification

Re: Question about the "EXPERIMENTAL" tag for dax in XFS

2021-03-01 Thread Dave Chinner
On Mon, Mar 01, 2021 at 04:32:36PM -0800, Dan Williams wrote:
> On Mon, Mar 1, 2021 at 2:47 PM Dave Chinner  wrote:
> > Now we have the filesytem people providing a mechanism for the pmem
> > devices to tell the filesystems about physical device failures so
> > they can handle such failures correctly themselves. Having the
> > device go away unexpectedly from underneath a mounted and active
> > filesystem is a *device failure*, not an "unplug event".
> 
> It's the same difference to the physical page, all mappings to that
> page need to be torn down. I'm happy to call an fs callback and let
> each filesystem do what it wants with a "every pfn in this dax device
> needs to be unmapped".

You keep talking like this is something specific to a DAX device.
It isn't - the filesystem needs to take specific actions if any type
of block device reports that it has a corrupted range, not just DAX.
A DAX device simply adds "and invalidate direct mappings" to the
list of stuff that needs to be done.

And as far as a filesystem is concerned, there is no difference
between "this 4kB range is bad" and "the range of this entire device
is bad". We have to do the same things in both situations.

> I'm looking at the ->corrupted_range() patches trying to map it to
> this use case and I don't see how, for example a realtime-xfs over DM
> over multiple PMEM gets the notification to the right place.
> bd_corrupted_range() uses get_super() which get the wrong answer for
> both realtime-xfs and DM.

I'm not sure I follow your logic. What is generating the wrong
answer?

We already have infrastructure for the block device to look up the
superblock mounted on top of it, an DM already uses that for things
like "dmsetup suspend" to freeze the filesystem before it does
something.  This "superblock lookup" only occurs for the top level
DM device, not for the component pmem devices that make up the DM
device.


IOWs, if there's a DM device that maps multiple pmem devices, then
it should be stacking the bd_corrupted_range() callbacks to map the
physical device range to the range in the higher level DM device
that belongs to. This mapping of ranges is what DM exists to do -
the filesystem has no clue about what devices make up a DM device,
so the DM device *must* translate ranges for component devices into
the ranges that it maps that device into the LBA range it exposes to
the filesystem.

> I'd flip that arrangement around and have the FS tell the block device
> "if something happens to you, here is the super_block to notify".

We already have a mechanism for this that the block device calls:
get_active_super(bdev). There can be only one superblock per block
device - the superblock has exclusive ownership of the block device
while the filesystem is mounted.

get_active_super() returns the superblock that sits on top of the
bdev with an active reference, allowing the caller to safely access
and operate on the sueprblock without having to worry about the
superblock going away in the middle of whatever operation the block
device needs to perform.

If this isn't working, then existing storage stack functionality
doesn't work as it should and this needs fixing independently of
the PMEM/DAX stuff we are talking about here.

> So
> to me this looks like a fs_dax_register_super() helper that plumbs the
> superblock through an arbitrary stack of block devices to the leaf
> block-device that might want to send a notification up when a global
> unmap operation needs to be performed.

No, this is just wrong. The filesystem has no clue what block device
is at the leaf level of a block device stack, nor what LBA block
range represents that device within the address space the stacked
block devices present to the filesystem.

> > Please listen when we say "that is
> > not sufficient" because we don't want to be backed into a corner
> > that we have to fix ourselves again before we can enable some basic
> > filesystem functionality that we should have been able to support on
> > DAX from the start...
> 
> That's some revisionist interpretation of how the discovery of the
> reflink+dax+memory-error-handling collision went down.
> 
> The whole point of this discussion is to determine if
> ->corrupted_range() is suitable for this notification, and looking at
> the code as is, it isn't. Perhaps you have a different implementation
> of ->corrupted_range() in mind that allows this to be plumbed
> correctly?

So rather than try to make the generic ->corrupted_range
infrastructure be able to report "DAX range is invalid" (which is
the very definition of a corrupted block device range!), you want
to introduce a DAX specific notification to tell us that a range in
the block device contains invalid/corrupt data?


Re: Question about the "EXPERIMENTAL" tag for dax in XFS

2021-03-01 Thread Dave Chinner
On Mon, Mar 01, 2021 at 12:55:53PM -0800, Dan Williams wrote:
> On Sun, Feb 28, 2021 at 2:39 PM Dave Chinner  wrote:
> >
> > On Sat, Feb 27, 2021 at 03:40:24PM -0800, Dan Williams wrote:
> > > On Sat, Feb 27, 2021 at 2:36 PM Dave Chinner  wrote:
> > > > On Fri, Feb 26, 2021 at 02:41:34PM -0800, Dan Williams wrote:
> > > > > On Fri, Feb 26, 2021 at 1:28 PM Dave Chinner  
> > > > > wrote:
> > > > > > On Fri, Feb 26, 2021 at 12:59:53PM -0800, Dan Williams wrote:
> > > > it points to, check if it points to the PMEM that is being removed,
> > > > grab the page it points to, map that to the relevant struct page,
> > > > run collect_procs() on that page, then kill the user processes that
> > > > map that page.
> > > >
> > > > So why can't we walk the ptescheck the physical pages that they
> > > > map to and if they map to a pmem page we go poison that
> > > > page and that kills any user process that maps it.
> > > >
> > > > i.e. I can't see how unexpected pmem device unplug is any different
> > > > to an MCE delivering a hwpoison event to a DAX mapped page.
> > >
> > > I guess the tradeoff is walking a long list of inodes vs walking a
> > > large array of pages.
> >
> > Not really. You're assuming all a filesystem has to do is invalidate
> > everything if a device goes away, and that's not true. Finding if an
> > inode has a mapping that spans a specific device in a multi-device
> > filesystem can be a lot more complex than that. Just walking inodes
> > is easy - determining whihc inodes need invalidation is the hard
> > part.
> 
> That inode-to-device level of specificity is not needed for the same
> reason that drop_caches does not need to be specific. If the wrong
> page is unmapped a re-fault will bring it back, and re-fault will fail
> for the pages that are successfully removed.
> 
> > That's where ->corrupt_range() comes in - the filesystem is already
> > set up to do reverse mapping from physical range to inode(s)
> > offsets...
> 
> Sure, but what is the need to get to that level of specificity with
> the filesystem for something that should rarely happen in the course
> of normal operation outside of a mistake?

Dan, you made this mistake with the hwpoisoning code that we're
trying to fix that here. Hard coding a 1:1 physical address to
inode/offset into the DAX mapping was a bad mistake. It's also one
that should never have occurred because it's *obviously wrong* to
filesystem developers and has been for a long time.

Now we have the filesytem people providing a mechanism for the pmem
devices to tell the filesystems about physical device failures so
they can handle such failures correctly themselves. Having the
device go away unexpectedly from underneath a mounted and active
filesystem is a *device failure*, not an "unplug event".

The mistake you made was not understanding how filesystems work,
nor actually asking filesystem developers what they actually needed.
You're doing the same thing here - you're telling us what you think
the solution filesystems need is. Please listen when we say "that is
not sufficient" because we don't want to be backed into a corner
that we have to fix ourselves again before we can enable some basic
filesystem functionality that we should have been able to support on
DAX from the start...

> > > There's likely always more pages than inodes, but perhaps it's more
> > > efficient to walk the 'struct page' array than sb->s_inodes?
> >
> > I really don't see you seem to be telling us that invalidation is an
> > either/or choice. There's more ways to convert physical block
> > address -> inode file offset and mapping index than brute force
> > inode cache walks
> 
> Yes, but I was trying to map it to an existing mechanism and the
> internals of drop_pagecache_sb() are, in coarse terms, close to what
> needs to happen here.

No.

drop_pagecache_sb() is not a relevant model for telling a filesystem
that the block device underneath has gone away, nor for a device to
ensure that access protections that *are managed by the filesystem*
are enforced/revoked sanely.

drop_pagecache_sb() is a brute-force model for invalidating user
data mappings that the filesystem performs in response to such a
notification. It only needs this brute-force approach if it has no
other way to find active DAX mappings across the range of the device
that has gone away.

But this model doesn't work for direct mapped metadata, journals or
any other internal direct filesystem mappings that aren't referenced
by inodes that the filesytem might be using. The filesystem still
needs to invali

Re: Question about the "EXPERIMENTAL" tag for dax in XFS

2021-02-28 Thread Dave Chinner
On Sat, Feb 27, 2021 at 03:40:24PM -0800, Dan Williams wrote:
> On Sat, Feb 27, 2021 at 2:36 PM Dave Chinner  wrote:
> > On Fri, Feb 26, 2021 at 02:41:34PM -0800, Dan Williams wrote:
> > > On Fri, Feb 26, 2021 at 1:28 PM Dave Chinner  wrote:
> > > > On Fri, Feb 26, 2021 at 12:59:53PM -0800, Dan Williams wrote:
> > it points to, check if it points to the PMEM that is being removed,
> > grab the page it points to, map that to the relevant struct page,
> > run collect_procs() on that page, then kill the user processes that
> > map that page.
> >
> > So why can't we walk the ptescheck the physical pages that they
> > map to and if they map to a pmem page we go poison that
> > page and that kills any user process that maps it.
> >
> > i.e. I can't see how unexpected pmem device unplug is any different
> > to an MCE delivering a hwpoison event to a DAX mapped page.
> 
> I guess the tradeoff is walking a long list of inodes vs walking a
> large array of pages.

Not really. You're assuming all a filesystem has to do is invalidate
everything if a device goes away, and that's not true. Finding if an
inode has a mapping that spans a specific device in a multi-device
filesystem can be a lot more complex than that. Just walking inodes
is easy - determining whihc inodes need invalidation is the hard
part.

That's where ->corrupt_range() comes in - the filesystem is already
set up to do reverse mapping from physical range to inode(s)
offsets...

> There's likely always more pages than inodes, but perhaps it's more
> efficient to walk the 'struct page' array than sb->s_inodes?

I really don't see you seem to be telling us that invalidation is an
either/or choice. There's more ways to convert physical block
address -> inode file offset and mapping index than brute force
inode cache walks

.

> > IOWs, what needs to happen at this point is very filesystem
> > specific. Assuming that "device unplug == filesystem dead" is not
> > correct, nor is specifying a generic action that assumes the
> > filesystem is dead because a device it is using went away.
> 
> Ok, I think I set this discussion in the wrong direction implying any
> mapping of this action to a "filesystem dead" event. It's just a "zap
> all ptes" event and upper layers recover from there.

Yes, that's exactly what ->corrupt_range() is intended for. It
allows the filesystem to lock out access to the bad range
and then recover the data. Or metadata, if that's where the bad
range lands. If that recovery fails, it can then report a data
loss/filesystem shutdown event to userspace and kill user procs that
span the bad range...

FWIW, is this notification going to occur before or after the device
has been physically unplugged? i.e. what do we do about the
time-of-unplug-to-time-of-invalidation window where userspace can
still attempt to access the missing pmem though the
not-yet-invalidated ptes? It may not be likely that people just yank
pmem nvdimms out of machines, but with NVMe persistent memory
spaces, there's every chance that someone pulls the wrong device...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: Question about the "EXPERIMENTAL" tag for dax in XFS

2021-02-27 Thread Dave Chinner
On Fri, Feb 26, 2021 at 02:41:34PM -0800, Dan Williams wrote:
> On Fri, Feb 26, 2021 at 1:28 PM Dave Chinner  wrote:
> > On Fri, Feb 26, 2021 at 12:59:53PM -0800, Dan Williams wrote:
> > > On Fri, Feb 26, 2021 at 12:51 PM Dave Chinner  wrote:
> > > > > My immediate concern is the issue Jason recently highlighted [1] with
> > > > > respect to invalidating all dax mappings when / if the device is
> > > > > ripped out from underneath the fs. I don't think that will collide
> > > > > with Ruan's implementation, but it does need new communication from
> > > > > driver to fs about removal events.
> > > > >
> > > > > [1]: 
> > > > > http://lore.kernel.org/r/capcyv4i+pzhyziepf2pah0dt5jdfkmkdx-3usqy1fahf6lp...@mail.gmail.com
> > > >
> > > > Oh, yay.
> > > >
> > > > The XFS shutdown code is centred around preventing new IO from being
> > > > issued - we don't actually do anything about DAX mappings because,
> > > > well, I don't think anyone on the filesystem side thought they had
> > > > to do anything special if pmem went away from under it.
> > > >
> > > > My understanding -was- that the pmem removal invalidates
> > > > all the ptes currently mapped into CPU page tables that point at
> > > > the dax device across the system. THe vmas that manage these
> > > > mappings are not really something the filesystem really manages,
> > > > but a function of the mm subsystem. What the filesystem cares about
> > > > is that it gets page faults triggered when a change of state occurs
> > > > so that it can remap the page to it's backing store correctly.
> > > >
> > > > IOWs, all the mm subsystem needs to when pmem goes away is clear the
> > > > CPU ptes, because then when then when userspace tries to access the
> > > > mapped DAX pages we get a new page fault. In processing the fault, the
> > > > filesystem will try to get direct access to the pmem from the block
> > > > device. This will get an ENODEV error from the block device because
> > > > because the backing store (pmem) has been unplugged and is no longer
> > > > there...
> > > >
> > > > AFAICT, as long as pmem removal invalidates all the active ptes that
> > > > point at the pmem being removed, the filesystem doesn't need to
> > > > care about device removal at all, DAX or no DAX...
> > >
> > > How would the pmem removal do that without walking all the active
> > > inodes in the fs at the time of shutdown and call
> > > unmap_mapping_range(inode->i_mapping, 0, 0, 1)?
> >
> > Which then immediately ends up back at the vmas that manage the ptes
> > to unmap them.
> >
> > Isn't finding the vma(s) that map a specific memory range exactly
> > what the rmap code in the mm subsystem is supposed to address?
> 
> rmap can lookup only vmas from a virt address relative to a given
> mm_struct. The driver has neither the list of mm_struct objects nor
> virt addresses to do a lookup. All it knows is that someone might have
> mapped pages through the fsdax interface.

So there's no physical addr to vma translation in the mm subsystem
at all?

That doesn't make sense. We do exactly this for hwpoison for DAX
mappings. While we don't look at ptes, we get a pfn, grab the page
it points to, check if it points to the PMEM that is being removed,
grab the page it points to, map that to the relevant struct page,
run collect_procs() on that page, then kill the user processes that
map that page.

So why can't we walk the ptes, check the physical pages that they
map to and if they map to a pmem page we go poison that
page and that kills any user process that maps it.

i.e. I can't see how unexpected pmem device unplug is any different
to an MCE delivering a hwpoison event to a DAX mapped page.  Both
indicate a physical address range now contains invalid data and the
filesystem has to take the same action...

IOWs, we could just call ->corrupted_range(0, EOD) here to tell the
filesystem the entire device went away. Then the filesystem deal
with this however it needs to. However, it would be more efficient
from an invalidation POV to just call it on the pages that have
currently active ptes because once the block device is dead
new page faults on DAX mappings will get a SIGBUS naturally.

> To me this looks like a notifier that fires from memunmap_pages()
> after dev_pagemap_kill() to notify any block_device associated with
> that dev_pagemap() to say that any dax mappings arranged through this
> block_device are now invalid. The reason 

Re: Question about the "EXPERIMENTAL" tag for dax in XFS

2021-02-26 Thread Dave Chinner
On Fri, Feb 26, 2021 at 12:59:53PM -0800, Dan Williams wrote:
> On Fri, Feb 26, 2021 at 12:51 PM Dave Chinner  wrote:
> >
> > On Fri, Feb 26, 2021 at 11:24:53AM -0800, Dan Williams wrote:
> > > On Fri, Feb 26, 2021 at 11:05 AM Darrick J. Wong  
> > > wrote:
> > > >
> > > > On Fri, Feb 26, 2021 at 09:45:45AM +, ruansy.f...@fujitsu.com wrote:
> > > > > Hi, guys
> > > > >
> > > > > Beside this patchset, I'd like to confirm something about the
> > > > > "EXPERIMENTAL" tag for dax in XFS.
> > > > >
> > > > > In XFS, the "EXPERIMENTAL" tag, which is reported in waring message
> > > > > when we mount a pmem device with dax option, has been existed for a
> > > > > while.  It's a bit annoying when using fsdax feature.  So, my initial
> > > > > intention was to remove this tag.  And I started to find out and solve
> > > > > the problems which prevent it from being removed.
> > > > >
> > > > > As is talked before, there are 3 main problems.  The first one is "dax
> > > > > semantics", which has been resolved.  The rest two are "RMAP for
> > > > > fsdax" and "support dax reflink for filesystem", which I have been
> > > > > working on.
> > > >
> > > > 
> > > >
> > > > > So, what I want to confirm is: does it means that we can remove the
> > > > > "EXPERIMENTAL" tag when the rest two problem are solved?
> > > >
> > > > Yes.  I'd keep the experimental tag for a cycle or two to make sure that
> > > > nothing new pops up, but otherwise the two patchsets you've sent close
> > > > those two big remaining gaps.  Thank you for working on this!
> > > >
> > > > > Or maybe there are other important problems need to be fixed before
> > > > > removing it?  If there are, could you please show me that?
> > > >
> > > > That remains to be seen through QA/validation, but I think that's it.
> > > >
> > > > Granted, I still have to read through the two patchsets...
> > >
> > > I've been meaning to circle back here as well.
> > >
> > > My immediate concern is the issue Jason recently highlighted [1] with
> > > respect to invalidating all dax mappings when / if the device is
> > > ripped out from underneath the fs. I don't think that will collide
> > > with Ruan's implementation, but it does need new communication from
> > > driver to fs about removal events.
> > >
> > > [1]: 
> > > http://lore.kernel.org/r/capcyv4i+pzhyziepf2pah0dt5jdfkmkdx-3usqy1fahf6lp...@mail.gmail.com
> >
> > Oh, yay.
> >
> > The XFS shutdown code is centred around preventing new IO from being
> > issued - we don't actually do anything about DAX mappings because,
> > well, I don't think anyone on the filesystem side thought they had
> > to do anything special if pmem went away from under it.
> >
> > My understanding -was- that the pmem removal invalidates
> > all the ptes currently mapped into CPU page tables that point at
> > the dax device across the system. THe vmas that manage these
> > mappings are not really something the filesystem really manages,
> > but a function of the mm subsystem. What the filesystem cares about
> > is that it gets page faults triggered when a change of state occurs
> > so that it can remap the page to it's backing store correctly.
> >
> > IOWs, all the mm subsystem needs to when pmem goes away is clear the
> > CPU ptes, because then when then when userspace tries to access the
> > mapped DAX pages we get a new page fault. In processing the fault, the
> > filesystem will try to get direct access to the pmem from the block
> > device. This will get an ENODEV error from the block device because
> > because the backing store (pmem) has been unplugged and is no longer
> > there...
> >
> > AFAICT, as long as pmem removal invalidates all the active ptes that
> > point at the pmem being removed, the filesystem doesn't need to
> > care about device removal at all, DAX or no DAX...
> 
> How would the pmem removal do that without walking all the active
> inodes in the fs at the time of shutdown and call
> unmap_mapping_range(inode->i_mapping, 0, 0, 1)?

Which then immediately ends up back at the vmas that manage the ptes
to unmap them.

Isn't finding the vma(s) that map a specific memory range exactly
what the rmap code in the mm subsystem is supposed to address?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: Question about the "EXPERIMENTAL" tag for dax in XFS

2021-02-26 Thread Dave Chinner
On Fri, Feb 26, 2021 at 11:24:53AM -0800, Dan Williams wrote:
> On Fri, Feb 26, 2021 at 11:05 AM Darrick J. Wong  wrote:
> >
> > On Fri, Feb 26, 2021 at 09:45:45AM +, ruansy.f...@fujitsu.com wrote:
> > > Hi, guys
> > >
> > > Beside this patchset, I'd like to confirm something about the
> > > "EXPERIMENTAL" tag for dax in XFS.
> > >
> > > In XFS, the "EXPERIMENTAL" tag, which is reported in waring message
> > > when we mount a pmem device with dax option, has been existed for a
> > > while.  It's a bit annoying when using fsdax feature.  So, my initial
> > > intention was to remove this tag.  And I started to find out and solve
> > > the problems which prevent it from being removed.
> > >
> > > As is talked before, there are 3 main problems.  The first one is "dax
> > > semantics", which has been resolved.  The rest two are "RMAP for
> > > fsdax" and "support dax reflink for filesystem", which I have been
> > > working on.
> >
> > 
> >
> > > So, what I want to confirm is: does it means that we can remove the
> > > "EXPERIMENTAL" tag when the rest two problem are solved?
> >
> > Yes.  I'd keep the experimental tag for a cycle or two to make sure that
> > nothing new pops up, but otherwise the two patchsets you've sent close
> > those two big remaining gaps.  Thank you for working on this!
> >
> > > Or maybe there are other important problems need to be fixed before
> > > removing it?  If there are, could you please show me that?
> >
> > That remains to be seen through QA/validation, but I think that's it.
> >
> > Granted, I still have to read through the two patchsets...
> 
> I've been meaning to circle back here as well.
> 
> My immediate concern is the issue Jason recently highlighted [1] with
> respect to invalidating all dax mappings when / if the device is
> ripped out from underneath the fs. I don't think that will collide
> with Ruan's implementation, but it does need new communication from
> driver to fs about removal events.
> 
> [1]: 
> http://lore.kernel.org/r/capcyv4i+pzhyziepf2pah0dt5jdfkmkdx-3usqy1fahf6lp...@mail.gmail.com

Oh, yay.

The XFS shutdown code is centred around preventing new IO from being
issued - we don't actually do anything about DAX mappings because,
well, I don't think anyone on the filesystem side thought they had
to do anything special if pmem went away from under it.

My understanding -was- that the pmem removal invalidates
all the ptes currently mapped into CPU page tables that point at
the dax device across the system. THe vmas that manage these
mappings are not really something the filesystem really manages,
but a function of the mm subsystem. What the filesystem cares about
is that it gets page faults triggered when a change of state occurs
so that it can remap the page to it's backing store correctly.

IOWs, all the mm subsystem needs to when pmem goes away is clear the
CPU ptes, because then when then when userspace tries to access the
mapped DAX pages we get a new page fault. In processing the fault, the
filesystem will try to get direct access to the pmem from the block
device. This will get an ENODEV error from the block device because
because the backing store (pmem) has been unplugged and is no longer
there...

AFAICT, as long as pmem removal invalidates all the active ptes that
point at the pmem being removed, the filesystem doesn't need to
care about device removal at all, DAX or no DAX...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: Expense of read_iter

2021-01-19 Thread Dave Chinner
On Fri, Jan 15, 2021 at 05:40:43PM +0800, Zhongwei Cai wrote:
> On Thu, 14 Jan 2021, Mikulas wrote:
> For Ext4-dax, the overhead of dax_iomap_rw is significant
> compared to the overhead of struct iov_iter. Although methods
> proposed by Mikulas can eliminate the overhead of iov_iter
> well, they can not be applied in Ext4-dax unless we implement an
> internal "read" method in Ext4-dax.
> 
> For Ext4-dax, there could be two approaches to optimizing:
> 1) implementing the internal "read" method without the complexity
> of iterators and dax_iomap_rw;

Please do not go an re-invent the wheel just for ext4. If there's a
problem in a shared path - ext2, FUSE and XFS all use dax_iomap_rw()
as well, so any improvements to that path benefit all DAX users, not
just ext4.

> 2) optimizing how dax_iomap_rw works.
> Since dax_iomap_rw requires ext4_iomap_begin, which further involves
> the iomap structure and others (e.g., journaling status locks in Ext4),
> we think implementing the internal "read" method would be easier.

Maybe it is, but it's also very selfish. The DAX iomap path was
written to be correct for all users, not inecessarily provide
optimal performance. There will be lots of things that could be done
to optimise it, so rather than creating a special snowflake in ext4
that makes DAX in ext4 much harder to maintain for non-ext4 DAX
developers, please work to improve the common DAX IO path and so
provide the same benefit to all the filesystems that use it.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [RFC PATCH v3 4/9] mm, fsdax: Refactor memory-failure handler for dax mapping

2020-12-16 Thread Dave Chinner
On Tue, Dec 15, 2020 at 08:14:09PM +0800, Shiyang Ruan wrote:
> The current memory_failure_dev_pagemap() can only handle single-mapped
> dax page for fsdax mode.  The dax page could be mapped by multiple files
> and offsets if we let reflink feature & fsdax mode work together.  So,
> we refactor current implementation to support handle memory failure on
> each file and offset.
> 
> Signed-off-by: Shiyang Ruan 
> ---
.
>  static const char *action_name[] = {
> @@ -1147,6 +1148,60 @@ static int try_to_split_thp_page(struct page *page, 
> const char *msg)
>   return 0;
>  }
>  
> +int mf_dax_mapping_kill_procs(struct address_space *mapping, pgoff_t index, 
> int flags)
> +{
> + const bool unmap_success = true;
> + unsigned long pfn, size = 0;
> + struct to_kill *tk;
> + LIST_HEAD(to_kill);
> + int rc = -EBUSY;
> + loff_t start;
> + dax_entry_t cookie;
> +
> + /*
> +  * Prevent the inode from being freed while we are interrogating
> +  * the address_space, typically this would be handled by
> +  * lock_page(), but dax pages do not use the page lock. This
> +  * also prevents changes to the mapping of this pfn until
> +  * poison signaling is complete.
> +  */
> + cookie = dax_lock(mapping, index, );
> + if (!cookie)
> + goto unlock;

Why do we need to prevent the inode from going away here? This
function gets called by XFS after doing an xfs_iget() call to grab
the inode that owns the block. Hence the the inode (and the mapping)
are guaranteed to be referenced and can't go away. Hence for the
filesystem based callers, this whole "dax_lock()" thing can go away.

So, AFAICT, the dax_lock() stuff is only necessary when the
filesystem can't be used to resolve the owner of physical page that
went bad

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [RFC PATCH v3 8/9] md: Implement ->corrupted_range()

2020-12-15 Thread Dave Chinner
On Tue, Dec 15, 2020 at 12:51:02PM -0800, Darrick J. Wong wrote:
> On Tue, Dec 15, 2020 at 08:14:13PM +0800, Shiyang Ruan wrote:
> > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> > index 4688bff19c20..e8cfaf860149 100644
> > --- a/drivers/nvdimm/pmem.c
> > +++ b/drivers/nvdimm/pmem.c
> > @@ -267,11 +267,14 @@ static int pmem_corrupted_range(struct gendisk *disk, 
> > struct block_device *bdev,
> >  
> > bdev_offset = (disk_sector - get_start_sect(bdev)) << SECTOR_SHIFT;
> > sb = get_super(bdev);
> > -   if (sb && sb->s_op->corrupted_range) {
> > +   if (!sb) {
> > +   rc = bd_disk_holder_corrupted_range(bdev, bdev_offset, len, 
> > data);
> > +   goto out;
> > +   } else if (sb->s_op->corrupted_range)
> > rc = sb->s_op->corrupted_range(sb, bdev, bdev_offset, len, 
> > data);
> > -   drop_super(sb);
> 
> This is out of scope for this patch(set) but do you think that the scsi
> disk driver should intercept media errors from sense data and call
> ->corrupted_range too?  ISTR Ted muttering that one of his employers had
> a patchset to do more with sense data than the upstream kernel currently
> does...

Most definitely!

That's the whole point of layering corrupt range reporting through
the device layers like this - the corrupted range reporting is not
limited specifically to pmem devices and so generic storage failures
(e.g.  RAID failures, hardware media failures, etc) can be reported
back up to the filesystem and we can take immediate, appropriate
action, including reporting to userspace that they just lost data in
file X at offset Y...

Combine that with the proposed "watch_sb()" syscall for reporting
such errors in a generic manner to interested listeners, and we've
got a fairly solid generic path for reporting data loss events to
userspace for an appropriate user-defined action to be taken...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [RFC PATCH v2 0/6] fsdax: introduce fs query to support reflink

2020-12-15 Thread Dave Chinner
On Tue, Dec 15, 2020 at 11:05:07AM -0800, Jane Chu wrote:
> On 12/15/2020 3:58 AM, Ruan Shiyang wrote:
> > Hi Jane
> > 
> > On 2020/12/15 上午4:58, Jane Chu wrote:
> > > Hi, Shiyang,
> > > 
> > > On 11/22/2020 4:41 PM, Shiyang Ruan wrote:
> > > > This patchset is a try to resolve the problem of tracking shared page
> > > > for fsdax.
> > > > 
> > > > Change from v1:
> > > >    - Intorduce ->block_lost() for block device
> > > >    - Support mapped device
> > > >    - Add 'not available' warning for realtime device in XFS
> > > >    - Rebased to v5.10-rc1
> > > > 
> > > > This patchset moves owner tracking from dax_assocaite_entry() to pmem
> > > > device, by introducing an interface ->memory_failure() of struct
> > > > pagemap.  The interface is called by memory_failure() in mm, and
> > > > implemented by pmem device.  Then pmem device calls its ->block_lost()
> > > > to find the filesystem which the damaged page located in, and call
> > > > ->storage_lost() to track files or metadata assocaited with this page.
> > > > Finally we are able to try to fix the damaged data in filesystem and do
> > > 
> > > Does that mean clearing poison? if so, would you mind to elaborate
> > > specifically which change does that?
> > 
> > Recovering data for filesystem (or pmem device) has not been done in
> > this patchset...  I just triggered the handler for the files sharing the
> > corrupted page here.
> 
> Thanks! That confirms my understanding.
> 
> With the framework provided by the patchset, how do you envision it to
> ease/simplify poison recovery from the user's perspective?

At the moment, I'd say no change what-so-ever. THe behaviour is
necessary so that we can kill whatever user application maps
multiply-shared physical blocks if there's a memory error. THe
recovery method from that is unchanged. The only advantage may be
that the filesystem (if rmap enabled) can tell you the exact file
and offset into the file where data was corrupted.

However, it can be worse, too: it may also now completely shut down
the filesystem if the filesystem discovers the error is in metadata
rather than user data. That's much more complex to recover from, and
right now will require downtime to take the filesystem offline and
run fsck to correct the error. That may trash whatever the metadata
that can't be recovered points to, so you still have a uesr data
recovery process to perform after this...

> And how does it help in dealing with page faults upon poisoned
> dax page?

It doesn't. If the page is poisoned, the same behaviour will occur
as does now. This is simply error reporting infrastructure, not
error handling.

Future work might change how we correct the faults found in the
storage, but I think the user visible behaviour is going to be "kill
apps mapping corrupted data" for a long time yet

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [RFC PATCH v2 0/6] fsdax: introduce fs query to support reflink

2020-12-06 Thread Dave Chinner
On Wed, Dec 02, 2020 at 03:12:20PM +0800, Ruan Shiyang wrote:
> Hi Dave,
> 
> On 2020/11/30 上午6:47, Dave Chinner wrote:
> > On Mon, Nov 23, 2020 at 08:41:10AM +0800, Shiyang Ruan wrote:
> > > 
> > > The call trace is like this:
> > >   memory_failure()
> > > pgmap->ops->memory_failure()   => pmem_pgmap_memory_failure()
> > >  gendisk->fops->block_lost()   => pmem_block_lost() or
> > >   md_blk_block_lost()
> > >   sb->s_ops->storage_lost()=> xfs_fs_storage_lost()
> > >xfs_rmap_query_range()
> > > xfs_storage_lost_helper()
> > >  mf_recover_controller->recover_fn => \
> > >  memory_failure_dev_pagemap_kill_procs()
> > > 
> > > The collect_procs() and kill_procs() are moved into a callback which
> > > is passed from memory_failure() to xfs_storage_lost_helper().  So we
> > > can call it when a file assocaited is found, instead of creating a
> > > file list and iterate it.
> > > 
> > > The fsdax & reflink support for XFS is not contained in this patchset.
> > 
> > This looks promising - the overall architecture is a lot more
> > generic and less dependent on knowing about memory, dax or memory
> > failures. A few comments that I think would further improve
> > understanding the patchset and the implementation:
> 
> Thanks for your kindly comment.  It gives me confidence.
> 
> > 
> > - the order of the patches is inverted. It should start with a
> >single patch introducing the mf_recover_controller structure for
> >callbacks, then introduce pgmap->ops->memory_failure, then
> >->block_lost, then the pmem and md implementations of ->block
> >list, then ->storage_lost and the XFS implementations of
> >->storage_lost.
> 
> Yes, it will be easier to understand the patchset in this order.
> 
> But I have something unsure: for example, I introduce ->memory_failure()
> firstly, but the implementation of ->memory_failure() needs to call
> ->block_lost() which is supposed to be introduced in the next patch. So, I
> am not sure the code is supposed to be what in the implementation of
> ->memory_failure() in pmem?  To avoid this situation, I committed the
> patches in the inverted order: lowest level first, then its caller, and then
> caller's caller.

Well, there's two things here. The first is the infrastructure, the
second is the drivers that use the infrastructure. You can introduce
a method in one patch, and then the driver that uses it in another.
Or you can introduce a driver skeleton that doesn't nothing until
more infrastructure is added. so...

> 
> I am trying to sort out the order.  How about this:
>  Patch i.
>Introduce ->memory_failure()
>   - just introduce interface, without implementation
>  Patch i++.
>Introduce ->block_lost()
>   - introduce interface and implement ->memory_failure()
>  in pmem, so that it can call ->block_lost()
>  Patch i++.
>(similar with above, skip...)

So this is better, but you don't need to add the pmem driver use of
"->block_lost" in the patch that adds the method. IOWs, something
like:

P1: introduce ->memory_failure API, all the required documentation
and add the call sites in the infrastructure that trigger it

P2: introduce ->corrupted_range to the block device API, all the
required documentation and any generic block infrastructure that
needs to call it.

P3: introduce ->corrupted_range to the superblock ops API, all the
required documentation

P4: add ->corrupted_range() API to the address space ops, all the
required documentation

P5: factor the existing kill procs stuff to be able to be called on
via generic_mapping_kill_range()

P5: add dax_mapping_kill_range()

P6: add the pmem driver support for ->memory_failure

P7: add the block device driver support for ->corrupted_range

P8: add filesystem support for sb_ops->corrupted_range.

P9: add filesystem support for aops->corrupted_range.

> >This gets rid of the mf_recover_controller altogether and allows
> >the interface to be used by any sort of block device for any sort
> >of bottom-up reporting of media/device failures.
> 
> Moving the recover function to the address_space ops looks a better idea.
> But I think that the error handler for page cache mapping is finished well
> in memory-failure.  The memory-failure is also reused to handles anonymous
> page.

Yes, anonymous page handling can remain there, we're only concerned
about handling file mapped pages here right now. If we end 

Re: [RFC PATCH v2 0/6] fsdax: introduce fs query to support reflink

2020-11-29 Thread Dave Chinner
On Mon, Nov 23, 2020 at 08:41:10AM +0800, Shiyang Ruan wrote:
> This patchset is a try to resolve the problem of tracking shared page
> for fsdax.
> 
> Change from v1:
>   - Intorduce ->block_lost() for block device
>   - Support mapped device
>   - Add 'not available' warning for realtime device in XFS
>   - Rebased to v5.10-rc1
> 
> This patchset moves owner tracking from dax_assocaite_entry() to pmem
> device, by introducing an interface ->memory_failure() of struct
> pagemap.  The interface is called by memory_failure() in mm, and
> implemented by pmem device.  Then pmem device calls its ->block_lost()
> to find the filesystem which the damaged page located in, and call
> ->storage_lost() to track files or metadata assocaited with this page.
> Finally we are able to try to fix the damaged data in filesystem and do
> other necessary processing, such as killing processes who are using the
> files affected.
> 
> The call trace is like this:
>  memory_failure()
>pgmap->ops->memory_failure()   => pmem_pgmap_memory_failure()
> gendisk->fops->block_lost()   => pmem_block_lost() or
>  md_blk_block_lost()
>  sb->s_ops->storage_lost()=> xfs_fs_storage_lost()
>   xfs_rmap_query_range()
>xfs_storage_lost_helper()
> mf_recover_controller->recover_fn => \ 
> memory_failure_dev_pagemap_kill_procs()
> 
> The collect_procs() and kill_procs() are moved into a callback which
> is passed from memory_failure() to xfs_storage_lost_helper().  So we
> can call it when a file assocaited is found, instead of creating a
> file list and iterate it.
> 
> The fsdax & reflink support for XFS is not contained in this patchset.

This looks promising - the overall architecture is a lot more
generic and less dependent on knowing about memory, dax or memory
failures. A few comments that I think would further improve
understanding the patchset and the implementation:

- the order of the patches is inverted. It should start with a
  single patch introducing the mf_recover_controller structure for
  callbacks, then introduce pgmap->ops->memory_failure, then
  ->block_lost, then the pmem and md implementations of ->block
  list, then ->storage_lost and the XFS implementations of
  ->storage_lost.

- I think the names "block_lost" and "storage_lost" are misleading.
  It's more like a "media failure" or a general "data corruption"
  event at a specific physical location. The data may not be "lost"
  but only damaged, so we might be able to recover from it without
  "losing" anything. Hence I think they could be better named,
  perhaps just "->corrupt_range"

- need to pass a {offset,len} pair through the chain, not just a
  single offset. This will allow other types of devices to report
  different ranges of failures, from a single sector to an entire
  device.

- I'm not sure that passing the mf_recover_controller structure
  through the corruption event chain is the right thing to do here.
  A block device could generate this storage failure callback if it
  detects an unrecoverable error (e.g. during a MD media scrub or
  rebuild/resilver failure) and in that case we don't have PFNs or
  memory device failure functions to perform.

  IOWs, I think the action that is taken needs to be independent of
  the source that generated the error. Even for a pmem device, we
  can be using the page cache, so it may be possible to recover the
  pmem error by writing the cached page (if it exists) back over the
  pmem.

  Hence I think that the recover function probably needs to be moved
  to the address space ops, because what we do to recover from the
  error is going to be dependent on type of mapping the filesystem
  is using. If it's a DAX mapping, we call back into a generic DAX
  function that does the vma walk and process kill functions. If it
  is a page cache mapping, then if the page is cached then we can
  try to re-write it to disk to fix the bad data, otherwise we treat
  it like a writeback error and report it on the next
  write/fsync/close operation done on that file.

  This gets rid of the mf_recover_controller altogether and allows
  the interface to be used by any sort of block device for any sort
  of bottom-up reporting of media/device failures.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: NVFS XFS metadata (was: [PATCH] pmem: export the symbols __copy_user_flushcache and __copy_from_user_flushcache)

2020-09-22 Thread Dave Chinner
On Tue, Sep 22, 2020 at 12:46:05PM -0400, Mikulas Patocka wrote:
> Thanks for reviewing NVFS.

Not a review - I've just had a cursory look and not looked any
deeper after I'd noticed various red flags...

> On Tue, 22 Sep 2020, Dave Chinner wrote:
> > IOWs, extent based trees were chosen because of scalability,
> > efficiency, and flexibility reasons before the actual tree structure
> > that it would be implemented with was decided on.  b+trees were used
> > in the implementation because one tree implementation could do
> > everything as all that needed to change btree trees was the pointer
> > and record format.
> 
> I agree that the b+tree were a good choice for XFS.
> 
> In RAM-based maps, red-black trees or avl trees are used often. In 
> disk-based maps, btrees or b+trees are used. That's because in RAM, you 
> are optimizing for the number of cache lines accessed, and on the disk, 
> you are optimizing for the number of blocks accessed.

https://lore.kernel.org/linux-fsdevel/20190416122240.gn29...@dread.disaster.area/

"FWIW, I'm not convinced about the scalability of the rb/interval
tree, to tell you the truth. We got rid of the rbtree in XFS for
cache indexing because the multi-level pointer chasing was just too
expensive to do under a spinlock - it's just not a cache efficient
structure for random index object storage."

All the work I've done since has reinforced this - small node
RCU-aware btrees (4 cachelines per node) scale much, much better
than rbtrees, and they can be made lockless, too.

One of the reasons that btrees are more efficient in memory is the
behaviour of modern CPUs and their hardware prefetchers. It is
actually more time efficient to do a linear search of a small node
and then move to another small node than it is to do a binary search
of a large node in memory.  The CPU usage trade-off between linear
search overhead and chasing another pointer is currently somewhere
between 4 and 8 cachelines or pointers/records in a node on modern
x86-64 CPUs.

SO, yeah, btrees are actually very efficient for in-memory indexes
for the same reasons they are efficient for on-disk structures -
they pack more information per node than a binary structure, and
it's faster to search within a node than is to fetch another node...

> > The result of this is that we have made -zero- changes to the XFS
> > structure and algorithms for SSDs. We don't do different things
> > based on the blkdev rotational flag, or anything like that. XFS
> > behaves exactly the same on spinning disks as it does SSDs as it
> > does PMEM and it performs well on all of them. And that performance
> > doesn't drop away as you increase the scale and capability of the
> > underlying storage.
> > 
> > That's what happens when storage algorithms are designed for
> > concurrency and efficiency at scale rather than optimising for a
> > specific storage characteristic.
> > 
> > NVFS is optimised for a specific storage characteristic (i.e. low
> > latency synchronous storage), so I would absolutely expect it to be
> > faster than XFS on that specific storage. However, claims like this:
> > 
> > > On persistent memory, each access has its own cost, so NVFS uses metadata 
> > > structures that minimize the number of cache lines accessed (rather than 
> > > the number of blocks accessed). For block mapping, NVFS uses the classic 
> > > unix dierct/indirect blocks - if a file block is mapped by a 3-rd level 
> > > indirect block, we do just three memory accesses and we are done. If we 
> > > used b+trees, the number of accesses would be much larger than 3 (we 
> > > would 
> > > have to do binary search in the b+tree nodes).
> > 
> > ... are kinda naive, because you're clearly optimising the wrong
> > aspect of block mapping. Extents solve the block indexing overhead
> > problem; optimising the type of tree you use to index the indirect
> > blocks doesn't avoid the overhead of having to iterate every block
> > for range operations.
> > 
> > IOWs, we use extents because they are space and time efficient for
> > the general use cases. XFS can map 2^21 blocks into a single 16 byte
> > extent record (8GiB file mapping for 4k block size) and so the vast
> > majority of files in a filesystem are mapped with a single extent.
> 
> BTW. How does XFS "predict" the file size? - so that it allocates extent 
> of proper size without knowing how big the file will be?

Oh, there's probably 10-15,000 lines of code involved in getting
that right. There's delayed allocation, speculative preallocation,
extent size hints, about 10 distinct allocation policies including
"allocate exactly at this block or fail" that allow compl

Re: NVFS XFS metadata (was: [PATCH] pmem: export the symbols __copy_user_flushcache and __copy_from_user_flushcache)

2020-09-21 Thread Dave Chinner
fications
it knows nothing about are executed atomically?

That, too me, looks like a fundamental, unfixable flaw in this
approach...

I can see how "almost in place" modification can be done by having
two copies side by side and updating one while the other is the
active copy and switching atomically between the two objects. That
way a traditional soft-update algorithm would work because the
exposure of the changes is via ordering the active copy switches.
That would come at a cost, though, both in metadata footprint and
CPU overhead.

So, what have I missed about the way metadata is updated in the pmem
that allows non-atomic updates to work reliably?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 2/9] fs: Introduce i_blocks_per_page

2020-08-25 Thread Dave Chinner
On Tue, Aug 25, 2020 at 01:49:22PM -0700, Darrick J. Wong wrote:
> On Mon, Aug 24, 2020 at 03:55:03PM +0100, Matthew Wilcox (Oracle) wrote:
> > This helper is useful for both THPs and for supporting block size larger
> > than page size.  Convert all users that I could find (we have a few
> > different ways of writing this idiom, and I may have missed some).
> > 
> > Signed-off-by: Matthew Wilcox (Oracle) 
> > Reviewed-by: Christoph Hellwig 
> 
> /me wonders what will happen when someone tries to make blocksz >
> pagesize work,

I abstract the page/block size stuff into "chunks". i.e. we work on
the smallest contiguous chunk of data the current combination of
page and inode define. In the context of this patch, it is simply
just:

s/i_blocks_per_page/iomap_chunks_per_page/g

i.e. The helper functions end up looking like this:

static inline unsigned
iomap_chunk_size(struct inode *inode, struct page *page)
{
   return min_t(unsigned, page_size(page), i_blocksize(inode));
}

static inline unsigned
iomap_chunk_bits(struct inode *inode, struct page *page)
{
   return min_t(unsigned, page_shift(page), inode->i_blkbits);
}

static inline unsigned
iomap_chunks_per_page(struct inode *inode, struct page *page)
{
   return page_size(page) >> inode->i_blkbits;
}

and the latter is actually the same as what i_block_per_page() is
currently implemented as

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 9/9] iomap: Change calling convention for zeroing

2020-08-25 Thread Dave Chinner
On Tue, Aug 25, 2020 at 01:40:24PM +0100, Matthew Wilcox wrote:
> Any objection to leaving this patch as-is with a u64 length?

No objection here - I just wanted to make sure that signed/unsigned
overflow was not going to be an issue...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 9/9] iomap: Change calling convention for zeroing

2020-08-24 Thread Dave Chinner
On Mon, Aug 24, 2020 at 09:35:59PM -0600, Andreas Dilger wrote:
> On Aug 24, 2020, at 9:26 PM, Matthew Wilcox  wrote:
> > 
> > On Tue, Aug 25, 2020 at 10:27:35AM +1000, Dave Chinner wrote:
> >>>   do {
> >>> - unsigned offset, bytes;
> >>> -
> >>> - offset = offset_in_page(pos);
> >>> - bytes = min_t(loff_t, PAGE_SIZE - offset, count);
> >>> + loff_t bytes;
> >>> 
> >>>   if (IS_DAX(inode))
> >>> - status = dax_iomap_zero(pos, offset, bytes, iomap);
> >>> + bytes = dax_iomap_zero(pos, length, iomap);
> >> 
> >> Hmmm. everything is loff_t here, but the callers are defining length
> >> as u64, not loff_t. Is there a potential sign conversion problem
> >> here? (sure 64 bit is way beyond anything we'll pass here, but...)
> > 
> > I've gone back and forth on the correct type for 'length' a few times.
> > size_t is too small (not for zeroing, but for seek()).  An unsigned type
> > seems right -- a length can't be negative, and we don't want to give
> > the impression that it can.  But the return value from these functions
> > definitely needs to be signed so we can represent an error.  So a u64
> > length with an loff_t return type feels like the best solution.  And
> > the upper layers have to promise not to pass in a length that's more
> > than 2^63-1.
> 
> The problem with allowing a u64 as the length is that it leads to the
> possibility of an argument value that cannot be returned.  Checking
> length < 0 is not worse than checking length > 0x7ff,
> and has the benefit of consistency with the other argument types and
> signs...

I think the problem here is that we have no guaranteed 64 bit size
type. when that was the case with off_t, we created loff_t to always
represent a 64 bit offset value. However, we never created one for
the count/size that is passed alongside loff_t in many places - it
was said that "syscalls are limited to 32 bit sizes" and
"size_t is 64 bit on 64 bit platforms" and so on and so we still
don't have a clean way to pass 64 bit sizes through the IO path.

We've been living with this shitty situation for a long time now, so
perhaps it's time for us to define lsize_t for 64 bit lengths and
start using that everywhere that needs a 64 bit clean path
through the code, regardless of whether the arch is 32 or 64 bit...

Thoughts?

-Dave.

-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 8/9] iomap: Convert iomap_write_end types

2020-08-24 Thread Dave Chinner
On Tue, Aug 25, 2020 at 02:06:05AM +0100, Matthew Wilcox wrote:
> On Tue, Aug 25, 2020 at 10:12:23AM +1000, Dave Chinner wrote:
> > > -static int
> > > -__iomap_write_end(struct inode *inode, loff_t pos, unsigned len,
> > > - unsigned copied, struct page *page)
> > > +static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t 
> > > len,
> > > + size_t copied, struct page *page)
> > >  {
> > 
> > Please leave the function declarations formatted the same way as
> > they currently are. They are done that way intentionally so it is
> > easy to grep for function definitions. Not to mention is't much
> > easier to read than when the function name is commingled into the
> > multiline paramener list like...
> 
> I understand that's true for XFS, but it's not true throughout the
> rest of the kernel. 

What other code does is irrelevant. I'm trying to maintain and
improve the consistency of the format used for the fs/iomap code.

> This file isn't even consistent:
> 
> buffered-io.c:static inline struct iomap_page *to_iomap_page(struct page 
> *page)
> buffered-io.c:static inline bool iomap_block_needs_zeroing(struct inode
> buffered-io.c:static int iomap_zero(struct inode *inode, loff_t pos, unsigned 
> offset,
> buffered-io.c:static void iomap_writepage_end_bio(struct bio *bio)
> buffered-io.c:static int __init iomap_init(void)
> 
> (i just grepped for ^static so there're other functions not covered by this)

5 functions that have that format, compared to 45 that do have the
formatting I asked you to retain. It think it's pretty clear which
way consistency lies here...

> The other fs/iomap/ files are equally inconsistent.

Inconsistency always occurs when multiple people modify the same
code. Often that's simply because reviewers haven't noticed the
inconsistency - it's certainly not intentional.

Saying "No, I'm going to make the code less consistent because it's
already slightly inconsistent" is, IMO, not a valid response to a
review request to conform to the existing code layout in that file,
especially if it improves the consistency of the code being
modified. That's really not negotiable

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 9/9] iomap: Change calling convention for zeroing

2020-08-24 Thread Dave Chinner
On Mon, Aug 24, 2020 at 03:55:10PM +0100, Matthew Wilcox (Oracle) wrote:
> Pass the full length to iomap_zero() and dax_iomap_zero(), and have
> them return how many bytes they actually handled.  This is preparatory
> work for handling THP, although it looks like DAX could actually take
> advantage of it if there's a larger contiguous area.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  fs/dax.c   | 13 ++---
>  fs/iomap/buffered-io.c | 33 +++--
>  include/linux/dax.h|  3 +--
>  3 files changed, 22 insertions(+), 27 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 95341af1a966..f2b912cb034e 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1037,18 +1037,18 @@ static vm_fault_t dax_load_hole(struct xa_state *xas,
>   return ret;
>  }
>  
> -int dax_iomap_zero(loff_t pos, unsigned offset, unsigned size,
> -struct iomap *iomap)
> +loff_t dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap)
>  {
>   sector_t sector = iomap_sector(iomap, pos & PAGE_MASK);
>   pgoff_t pgoff;
>   long rc, id;
>   void *kaddr;
>   bool page_aligned = false;
> -
> + unsigned offset = offset_in_page(pos);
> + unsigned size = min_t(u64, PAGE_SIZE - offset, length);
>  
>   if (IS_ALIGNED(sector << SECTOR_SHIFT, PAGE_SIZE) &&
> - IS_ALIGNED(size, PAGE_SIZE))
> + (size == PAGE_SIZE))
>   page_aligned = true;
>  
>   rc = bdev_dax_pgoff(iomap->bdev, sector, PAGE_SIZE, );
> @@ -1058,8 +1058,7 @@ int dax_iomap_zero(loff_t pos, unsigned offset, 
> unsigned size,
>   id = dax_read_lock();
>  
>   if (page_aligned)
> - rc = dax_zero_page_range(iomap->dax_dev, pgoff,
> -  size >> PAGE_SHIFT);
> + rc = dax_zero_page_range(iomap->dax_dev, pgoff, 1);
>   else
>   rc = dax_direct_access(iomap->dax_dev, pgoff, 1, , NULL);
>   if (rc < 0) {
> @@ -1072,7 +1071,7 @@ int dax_iomap_zero(loff_t pos, unsigned offset, 
> unsigned size,
>   dax_flush(iomap->dax_dev, kaddr + offset, size);
>   }
>   dax_read_unlock(id);
> - return 0;
> + return size;
>  }
>  
>  static loff_t
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 7f618ab4b11e..2dba054095e8 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -901,11 +901,13 @@ iomap_file_unshare(struct inode *inode, loff_t pos, 
> loff_t len,
>  }
>  EXPORT_SYMBOL_GPL(iomap_file_unshare);
>  
> -static int iomap_zero(struct inode *inode, loff_t pos, unsigned offset,
> - unsigned bytes, struct iomap *iomap, struct iomap *srcmap)
> +static loff_t iomap_zero(struct inode *inode, loff_t pos, u64 length,
> + struct iomap *iomap, struct iomap *srcmap)

While you are touching this, please convert it back to:

static loff_t
iomap_zero(struct inode *inode, loff_t pos, u64 length,
struct iomap *iomap, struct iomap *srcmap)


>  {
>   struct page *page;
>   int status;
> + unsigned offset = offset_in_page(pos);
> + unsigned bytes = min_t(u64, PAGE_SIZE - offset, length);
>  
>   status = iomap_write_begin(inode, pos, bytes, 0, , iomap, srcmap);
>   if (status)
> @@ -917,38 +919,33 @@ static int iomap_zero(struct inode *inode, loff_t pos, 
> unsigned offset,
>   return iomap_write_end(inode, pos, bytes, bytes, page, iomap, srcmap);
>  }
>  
> -static loff_t
> -iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
> - void *data, struct iomap *iomap, struct iomap *srcmap)
> +static loff_t iomap_zero_range_actor(struct inode *inode, loff_t pos,
> + loff_t length, void *data, struct iomap *iomap,
> + struct iomap *srcmap)

And leave this as it was formatted.

/me missed this when iomap_readahead() was introduced, too.


>  {
>   bool *did_zero = data;
>   loff_t written = 0;
> - int status;
>  
>   /* already zeroed?  we're done. */
>   if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
> - return count;
> + return length;
>  
>   do {
> - unsigned offset, bytes;
> -
> - offset = offset_in_page(pos);
> - bytes = min_t(loff_t, PAGE_SIZE - offset, count);
> + loff_t bytes;
>  
>   if (IS_DAX(inode))
> - status = dax_iomap_zero(pos, offset, bytes, iomap);
> + bytes = dax_iomap_zero(pos, length, iomap);

Hmmm. everything is loff_t here, but the callers are defining length
as u64, not loff_t. Is there a potential sign conversion problem
here? (sure 64 bit is way beyond anything we'll pass here, but...)

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 8/9] iomap: Convert iomap_write_end types

2020-08-24 Thread Dave Chinner
On Mon, Aug 24, 2020 at 03:55:09PM +0100, Matthew Wilcox (Oracle) wrote:
> iomap_write_end cannot return an error, so switch it to return
> size_t instead of int and remove the error checking from the callers.
> Also convert the arguments to size_t from unsigned int, in case anyone
> ever wants to support a page size larger than 2GB.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  fs/iomap/buffered-io.c | 31 ---
>  1 file changed, 12 insertions(+), 19 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 8c6187a6f34e..7f618ab4b11e 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -666,9 +666,8 @@ iomap_set_page_dirty(struct page *page)
>  }
>  EXPORT_SYMBOL_GPL(iomap_set_page_dirty);
>  
> -static int
> -__iomap_write_end(struct inode *inode, loff_t pos, unsigned len,
> - unsigned copied, struct page *page)
> +static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
> + size_t copied, struct page *page)
>  {

Please leave the function declarations formatted the same way as
they currently are. They are done that way intentionally so it is
easy to grep for function definitions. Not to mention is't much
easier to read than when the function name is commingled into the
multiline paramener list like...

> -static int
> -iomap_write_end(struct inode *inode, loff_t pos, unsigned len, unsigned 
> copied,
> - struct page *page, struct iomap *iomap, struct iomap *srcmap)
> +/* Returns the number of bytes copied.  May be 0.  Cannot be an errno. */
> +static size_t iomap_write_end(struct inode *inode, loff_t pos, size_t len,
> + size_t copied, struct page *page, struct iomap *iomap,
> +     struct iomap *srcmap)

... this.

Otherwise the code looks fine.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 6/9] iomap: Convert read_count to byte count

2020-08-24 Thread Dave Chinner
On Mon, Aug 24, 2020 at 03:55:07PM +0100, Matthew Wilcox (Oracle) wrote:
> Instead of counting bio segments, count the number of bytes submitted.
> This insulates us from the block layer's definition of what a 'same page'
> is, which is not necessarily clear once THPs are involved.

I'd like to see a comment on the definition of struct iomap_page to
indicate that read_count (and write count) reflect the byte count of
IO currently in flight on the page, not an IO count, because THP
makes tracking this via bio state hard. Otherwise it is not at all
obvious why it is done and why it is intentional...

Otherwise the code looks OK.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 5/9] iomap: Support arbitrarily many blocks per page

2020-08-24 Thread Dave Chinner
On Mon, Aug 24, 2020 at 03:55:06PM +0100, Matthew Wilcox (Oracle) wrote:
> Size the uptodate array dynamically to support larger pages in the
> page cache.  With a 64kB page, we're only saving 8 bytes per page today,
> but with a 2MB maximum page size, we'd have to allocate more than 4kB
> per page.  Add a few debugging assertions.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  fs/iomap/buffered-io.c | 14 ++
>  1 file changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index dbf9572dabe9..844e95cacea8 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -22,18 +22,19 @@
>  #include "../internal.h"
>  
>  /*
> - * Structure allocated for each page when block size < PAGE_SIZE to track
> + * Structure allocated for each page when block size < page size to track
>   * sub-page uptodate status and I/O completions.
>   */
>  struct iomap_page {
>   atomic_tread_count;
>   atomic_twrite_count;
>   spinlock_t  uptodate_lock;
> - DECLARE_BITMAP(uptodate, PAGE_SIZE / 512);
> + unsigned long   uptodate[];
>  };
>  
>  static inline struct iomap_page *to_iomap_page(struct page *page)
>  {
> + VM_BUG_ON_PGFLAGS(PageTail(page), page);
>   if (page_has_private(page))
>   return (struct iomap_page *)page_private(page);
>   return NULL;

Just to confirm: this vm bug check is to needed becuse we only
attach the iomap_page to the head page of a compound page?

Assuming that I've understood the above correctly:

Reviewed-by: Dave Chinner 
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 4/9] iomap: Use bitmap ops to set uptodate bits

2020-08-24 Thread Dave Chinner
On Mon, Aug 24, 2020 at 03:55:05PM +0100, Matthew Wilcox (Oracle) wrote:
> Now that the bitmap is protected by a spinlock, we can use the
> more efficient bitmap ops instead of individual test/set bit ops.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> Reviewed-by: Christoph Hellwig 
> ---
>  fs/iomap/buffered-io.c | 12 ++--
>  1 file changed, 2 insertions(+), 10 deletions(-)

Looks good.

Reviewed-by: Dave Chinner 
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 2/9] fs: Introduce i_blocks_per_page

2020-08-24 Thread Dave Chinner
On Mon, Aug 24, 2020 at 03:55:03PM +0100, Matthew Wilcox (Oracle) wrote:
> This helper is useful for both THPs and for supporting block size larger
> than page size.  Convert all users that I could find (we have a few
> different ways of writing this idiom, and I may have missed some).

There probably is - ISTR having a lot more of these changes for the
block_size > page_size patches as it needed the same {page, inode}
abstraction for calculating the range to iterate. But they can be
dealt with on a case by case basis, I think.

> Signed-off-by: Matthew Wilcox (Oracle) 
> Reviewed-by: Christoph Hellwig 
> ---
>  fs/iomap/buffered-io.c  |  8 
>  fs/jfs/jfs_metapage.c   |  2 +-
>  fs/xfs/xfs_aops.c   |  2 +-
>  include/linux/pagemap.h | 16 
>  4 files changed, 22 insertions(+), 6 deletions(-)

Otherwise looks good.

Reviewed-by: Dave Chinner 

-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 3/9] iomap: Use kzalloc to allocate iomap_page

2020-08-24 Thread Dave Chinner
On Mon, Aug 24, 2020 at 03:55:04PM +0100, Matthew Wilcox (Oracle) wrote:
> We can skip most of the initialisation, although spinlocks still
> need explicit initialisation as architectures may use a non-zero
> value to indicate unlocked.  The comment is no longer useful as
> attach_page_private() handles the refcount now.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> Reviewed-by: Christoph Hellwig 
> ---
>  fs/iomap/buffered-io.c | 10 +-
>  1 file changed, 1 insertion(+), 9 deletions(-)

The sooner this goes in the better :)

Reviewed-by: Dave Chinner 

-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 1/9] iomap: Fix misplaced page flushing

2020-08-24 Thread Dave Chinner
On Mon, Aug 24, 2020 at 03:55:02PM +0100, Matthew Wilcox (Oracle) wrote:
> If iomap_unshare_actor() unshares to an inline iomap, the page was
> not being flushed.  block_write_end() and __iomap_write_end() already
> contain flushes, so adding it to iomap_write_end_inline() seems like
> the best place.  That means we can remove it from iomap_write_actor().
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  fs/iomap/buffered-io.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)

looks good.

Reviewed-by: Dave Chinner 

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [RFC PATCH 0/8] fsdax: introduce FS query interface to support reflink

2020-08-10 Thread Dave Chinner
On Mon, Aug 10, 2020 at 12:16:57PM +0100, Matthew Wilcox wrote:
> On Mon, Aug 10, 2020 at 04:15:50PM +0800, Ruan Shiyang wrote:
> > 
> > 
> > On 2020/8/7 下午9:38, Matthew Wilcox wrote:
> > > On Fri, Aug 07, 2020 at 09:13:28PM +0800, Shiyang Ruan wrote:
> > > > This patchset is a try to resolve the problem of tracking shared page
> > > > for fsdax.
> > > > 
> > > > Instead of per-page tracking method, this patchset introduces a query
> > > > interface: get_shared_files(), which is implemented by each FS, to
> > > > obtain the owners of a shared page.  It returns an owner list of this
> > > > shared page.  Then, the memory-failure() iterates the list to be able
> > > > to notify each process using files that sharing this page.
> > > > 
> > > > The design of the tracking method is as follow:
> > > > 1. dax_assocaite_entry() associates the owner's info to this page
> > > 
> > > I think that's the first problem with this design.  dax_associate_entry is
> > > a horrendous idea which needs to be ripped out, not made more important.
> > > It's all part of the general problem of trying to do something on a
> > > per-page basis instead of per-extent basis.
> > > 
> > 
> > The memory-failure needs to track owners info from a dax page, so I should
> > associate the owner with this page.  In this version, I associate the block
> > device to the dax page, so that the memory-failure is able to iterate the
> > owners by the query interface provided by filesystem.
> 
> No, it doesn't need to track owner info from a DAX page.  What it needs to
> do is ask the filesystem.

Just to add to this: the owner tracking that is current done deep
inside the DAX code needs to be moved out to the owner of the dax
device. That may be the dax device itself, or it may be a filesystem
like ext4 or XFS. Initially, nothing will be able to share pages and
page owner tracking should be done on the page itself as the DAX
code currently does.

Hence when a page error occurs, the device owner is called with the
page and error information, and the dax device or filesystem can
then look up the page owner (via mapping/index pointers) and run the
Die Userspace Die functions instead of running this all internally
in the DAX code.

Once the implementation is abstracted and controlled by the device
owner, then we can start working to change the XFS implementation to
support shared pages

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [RFC 1/1] pmem: Add cond_resched() in bio_for_each_segment loop in pmem_make_request

2020-08-03 Thread Dave Chinner
On Mon, Aug 03, 2020 at 12:44:04PM +0530, Ritesh Harjani wrote:
> 
> 
> On 8/3/20 4:31 AM, Dave Chinner wrote:
> > On Wed, Jul 29, 2020 at 02:15:18PM +0530, Ritesh Harjani wrote:
> > > For systems which do not have CONFIG_PREEMPT set and
> > > if there is a heavy multi-threaded load/store operation happening
> > > on pmem + sometimes along with device latencies, softlockup warnings like
> > > this could trigger. This was seen on Power where pagesize is 64K.
> > > 
> > > To avoid softlockup, this patch adds a cond_resched() in this path.
> > > 
> > > <...>
> > > watchdog: BUG: soft lockup - CPU#31 stuck for 22s!
> > > <...>
> > > CPU: 31 PID: 15627 <..> 5.3.18-20
> > > <...>
> > > NIP memcpy_power7+0x43c/0x7e0
> > > LR memcpy_flushcache+0x28/0xa0
> > > 
> > > Call Trace:
> > > memcpy_power7+0x274/0x7e0 (unreliable)
> > > memcpy_flushcache+0x28/0xa0
> > > write_pmem+0xa0/0x100 [nd_pmem]
> > > pmem_do_bvec+0x1f0/0x420 [nd_pmem]
> > > pmem_make_request+0x14c/0x370 [nd_pmem]
> > > generic_make_request+0x164/0x400
> > > submit_bio+0x134/0x2e0
> > > submit_bio_wait+0x70/0xc0
> > > blkdev_issue_zeroout+0xf4/0x2a0
> > > xfs_zero_extent+0x90/0xc0 [xfs]
> > > xfs_bmapi_convert_unwritten+0x198/0x230 [xfs]
> > > xfs_bmapi_write+0x284/0x630 [xfs]
> > > xfs_iomap_write_direct+0x1f0/0x3e0 [xfs]
> > > xfs_file_iomap_begin+0x344/0x690 [xfs]
> > > dax_iomap_pmd_fault+0x488/0xc10
> > > __xfs_filemap_fault+0x26c/0x2b0 [xfs]
> > > __handle_mm_fault+0x794/0x1af0
> > > handle_mm_fault+0x12c/0x220
> > > __do_page_fault+0x290/0xe40
> > > do_page_fault+0x38/0xc0
> > > handle_page_fault+0x10/0x30
> > > 
> > > Reviewed-by: Aneesh Kumar K.V 
> > > Signed-off-by: Ritesh Harjani 
> > > ---
> > >   drivers/nvdimm/pmem.c | 1 +
> > >   1 file changed, 1 insertion(+)
> > > 
> > > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> > > index 2df6994acf83..fcf7af13897e 100644
> > > --- a/drivers/nvdimm/pmem.c
> > > +++ b/drivers/nvdimm/pmem.c
> > > @@ -214,6 +214,7 @@ static blk_qc_t pmem_make_request(struct 
> > > request_queue *q, struct bio *bio)
> > >   bio->bi_status = rc;
> > >   break;
> > >   }
> > > + cond_resched();
> > 
> > There are already cond_resched() calls between submitted bios in
> > blkdev_issue_zeroout() via both __blkdev_issue_zero_pages() and
> > __blkdev_issue_write_zeroes(), so I'm kinda wondering where the
> > problem is coming from here.
> 
> This problem is coming from that bio call- submit_bio()
> 
> > 
> > Just how big is the bio being issued here that it spins for 22s
> > trying to copy it?
> 
> It's 256 (due to BIO_MAX_PAGES) * 64KB (pagesize) = 16MB.
> So this is definitely not an easy trigger as per tester was mainly seen
> on a VM.

Right, so a memcpy() of 16MB of data in a single bio is taking >22s?

If so, then you can't solve this problem with cond_resched() - if
something that should only take half a millisecond to run is taking
22s of CPU time, there are bigger problems occurring that need
fixing.

i.e. if someone is running a workload that has effectively
livelocked the hardware cache coherency protocol across the entire
machine, then changing kernel code that requires memory bandwidth to
be available is not going to fix the problem. All is does is shoot
the messenger that tells you there is something going wrong.

> Looking at the cond_resched() inside dax_writeback_mapping_range()
> in xas_for_each_marked() loop, I thought it should be good to have a
> cond_resched() in the above path as well.

That's not done on every page/bio - that'd done every 4096 pages, or
every 256MB of memory written back on 64k page machines. IOWs, using
cond_resched() here is a sensible thing to do if you have a locked
loop that might run for seconds.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [RFC 1/1] pmem: Add cond_resched() in bio_for_each_segment loop in pmem_make_request

2020-08-02 Thread Dave Chinner
On Wed, Jul 29, 2020 at 02:15:18PM +0530, Ritesh Harjani wrote:
> For systems which do not have CONFIG_PREEMPT set and
> if there is a heavy multi-threaded load/store operation happening
> on pmem + sometimes along with device latencies, softlockup warnings like
> this could trigger. This was seen on Power where pagesize is 64K.
> 
> To avoid softlockup, this patch adds a cond_resched() in this path.
> 
> <...>
> watchdog: BUG: soft lockup - CPU#31 stuck for 22s!
> <...>
> CPU: 31 PID: 15627 <..> 5.3.18-20
> <...>
> NIP memcpy_power7+0x43c/0x7e0
> LR memcpy_flushcache+0x28/0xa0
> 
> Call Trace:
> memcpy_power7+0x274/0x7e0 (unreliable)
> memcpy_flushcache+0x28/0xa0
> write_pmem+0xa0/0x100 [nd_pmem]
> pmem_do_bvec+0x1f0/0x420 [nd_pmem]
> pmem_make_request+0x14c/0x370 [nd_pmem]
> generic_make_request+0x164/0x400
> submit_bio+0x134/0x2e0
> submit_bio_wait+0x70/0xc0
> blkdev_issue_zeroout+0xf4/0x2a0
> xfs_zero_extent+0x90/0xc0 [xfs]
> xfs_bmapi_convert_unwritten+0x198/0x230 [xfs]
> xfs_bmapi_write+0x284/0x630 [xfs]
> xfs_iomap_write_direct+0x1f0/0x3e0 [xfs]
> xfs_file_iomap_begin+0x344/0x690 [xfs]
> dax_iomap_pmd_fault+0x488/0xc10
> __xfs_filemap_fault+0x26c/0x2b0 [xfs]
> __handle_mm_fault+0x794/0x1af0
> handle_mm_fault+0x12c/0x220
> __do_page_fault+0x290/0xe40
> do_page_fault+0x38/0xc0
> handle_page_fault+0x10/0x30
> 
> Reviewed-by: Aneesh Kumar K.V 
> Signed-off-by: Ritesh Harjani 
> ---
>  drivers/nvdimm/pmem.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 2df6994acf83..fcf7af13897e 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -214,6 +214,7 @@ static blk_qc_t pmem_make_request(struct request_queue 
> *q, struct bio *bio)
>   bio->bi_status = rc;
>   break;
>   }
> + cond_resched();

There are already cond_resched() calls between submitted bios in
blkdev_issue_zeroout() via both __blkdev_issue_zero_pages() and
__blkdev_issue_write_zeroes(), so I'm kinda wondering where the
problem is coming from here.

Just how big is the bio being issued here that it spins for 22s
trying to copy it?

And, really, if the system is that bound on cacheline bouncing that
it prevents memcpy() from making progress, I think we probably
should be issuing a soft lockup warning like this...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: Can we change the S_DAX flag immediately on XFS without dropping caches?

2020-07-29 Thread Dave Chinner
On Wed, Jul 29, 2020 at 11:23:21AM +0900, Yasunori Goto wrote:
> Hi,
> 
> On 2020/07/28 11:20, Dave Chinner wrote:
> > On Tue, Jul 28, 2020 at 02:00:08AM +, Li, Hao wrote:
> > > Hi,
> > > 
> > > I have noticed that we have to drop caches to make the changing of S_DAX
> > > flag take effect after using chattr +x to turn on DAX for a existing
> > > regular file. The related function is xfs_diflags_to_iflags, whose
> > > second parameter determines whether we should set S_DAX immediately.
> > Yup, as documented in Documentation/filesystems/dax.txt. Specifically:
> > 
> >   6. When changing the S_DAX policy via toggling the persistent 
> > FS_XFLAG_DAX flag,
> >  the change in behaviour for existing regular files may not occur
> >  immediately.  If the change must take effect immediately, the 
> > administrator
> >  needs to:
> > 
> >  a) stop the application so there are no active references to the data 
> > set
> > the policy change will affect
> > 
> >  b) evict the data set from kernel caches so it will be re-instantiated 
> > when
> > the application is restarted. This can be achieved by:
> > 
> > i. drop-caches
> > ii. a filesystem unmount and mount cycle
> > iii. a system reboot
> > 
> > > I can't figure out why we do this. Is this because the page caches in
> > > address_space->i_pages are hard to deal with?
> > Because of unfixable races in the page fault path that prevent
> > changing the caching behaviour of the inode while concurrent access
> > is possible. The only way to guarantee races can't happen is to
> > cycle the inode out of cache.
> 
> I understand why the drop_cache operation is necessary. Thanks.
> 
> BTW, even normal user becomes to able to change DAX flag for an inode,
> drop_cache operation still requires root permission, right?

Step back for a minute and explain why you want to be able to change
the DAX mode of a file -as a user-.

> So, if kernel have a feature for normal user can operate drop cache for "a
> inode" with
> its permission, I think it improve the above limitation, and
> we would like to try to implement it recently.

No, drop_caches is not going to be made available to users. That
makes it s trivial system wide DoS vector.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: Can we change the S_DAX flag immediately on XFS without dropping caches?

2020-07-27 Thread Dave Chinner
On Tue, Jul 28, 2020 at 02:00:08AM +, Li, Hao wrote:
> Hi,
> 
> I have noticed that we have to drop caches to make the changing of S_DAX
> flag take effect after using chattr +x to turn on DAX for a existing
> regular file. The related function is xfs_diflags_to_iflags, whose
> second parameter determines whether we should set S_DAX immediately.

Yup, as documented in Documentation/filesystems/dax.txt. Specifically:

 6. When changing the S_DAX policy via toggling the persistent FS_XFLAG_DAX 
flag,
the change in behaviour for existing regular files may not occur
immediately.  If the change must take effect immediately, the administrator
needs to:

a) stop the application so there are no active references to the data set
   the policy change will affect

b) evict the data set from kernel caches so it will be re-instantiated when
   the application is restarted. This can be achieved by:

   i. drop-caches
   ii. a filesystem unmount and mount cycle
   iii. a system reboot

> I can't figure out why we do this. Is this because the page caches in
> address_space->i_pages are hard to deal with?

Because of unfixable races in the page fault path that prevent
changing the caching behaviour of the inode while concurrent access
is possible. The only way to guarantee races can't happen is to
cycle the inode out of cache.

> I also wonder what will
> happen if we set S_DAX unconditionally. Thanks!

As per the documentation, that's exactly what the dax=always mount
option does.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: 回复: Re: [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink

2020-06-04 Thread Dave Chinner
On Thu, Jun 04, 2020 at 07:51:07AM -0700, Darrick J. Wong wrote:
> On Thu, Jun 04, 2020 at 03:37:42PM +0800, Ruan Shiyang wrote:
> > 
> > 
> > On 2020/4/28 下午2:43, Dave Chinner wrote:
> > > On Tue, Apr 28, 2020 at 06:09:47AM +, Ruan, Shiyang wrote:
> > > > 
> > > > 在 2020/4/27 20:28:36, "Matthew Wilcox"  写道:
> > > > 
> > > > > On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote:
> > > > > >   This patchset is a try to resolve the shared 'page cache' problem 
> > > > > > for
> > > > > >   fsdax.
> > > > > > 
> > > > > >   In order to track multiple mappings and indexes on one page, I
> > > > > >   introduced a dax-rmap rb-tree to manage the relationship.  A dax 
> > > > > > entry
> > > > > >   will be associated more than once if is shared.  At the second 
> > > > > > time we
> > > > > >   associate this entry, we create this rb-tree and store its root in
> > > > > >   page->private(not used in fsdax).  Insert (->mapping, ->index) 
> > > > > > when
> > > > > >   dax_associate_entry() and delete it when dax_disassociate_entry().
> > > > > 
> > > > > Do we really want to track all of this on a per-page basis?  I would
> > > > > have thought a per-extent basis was more useful.  Essentially, create
> > > > > a new address_space for each shared extent.  Per page just seems like
> > > > > a huge overhead.
> > > > > 
> > > > Per-extent tracking is a nice idea for me.  I haven't thought of it
> > > > yet...
> > > > 
> > > > But the extent info is maintained by filesystem.  I think we need a way
> > > > to obtain this info from FS when associating a page.  May be a bit
> > > > complicated.  Let me think about it...
> > > 
> > > That's why I want the -user of this association- to do a filesystem
> > > callout instead of keeping it's own naive tracking infrastructure.
> > > The filesystem can do an efficient, on-demand reverse mapping lookup
> > > from it's own extent tracking infrastructure, and there's zero
> > > runtime overhead when there are no errors present.
> > 
> > Hi Dave,
> > 
> > I ran into some difficulties when trying to implement the per-extent rmap
> > tracking.  So, I re-read your comments and found that I was misunderstanding
> > what you described here.
> > 
> > I think what you mean is: we don't need the in-memory dax-rmap tracking now.
> > Just ask the FS for the owner's information that associate with one page
> > when memory-failure.  So, the per-page (even per-extent) dax-rmap is
> > needless in this case.  Is this right?
> 
> Right.  XFS already has its own rmap tree.

*nod*

> > Based on this, we only need to store the extent information of a fsdax page
> > in its ->mapping (by searching from FS).  Then obtain the owners of this
> > page (also by searching from FS) when memory-failure or other rmap case
> > occurs.
> 
> I don't even think you need that much.  All you need is the "physical"
> offset of that page within the pmem device (e.g. 'this is the 307th 4k
> page == offset 1257472 since the start of /dev/pmem0') and xfs can look
> up the owner of that range of physical storage and deal with it as
> needed.

Right. If we have the dax device associated with the page that had
the failure, then we can determine the offset of the page into the
block device address space and that's all we need to find the owner
of the page in the filesystem.

Note that there may actually be no owner - the page that had the
fault might land in free space, in which case we can simply zero
the page and clear the error.

> > So, a fsdax page is no longer associated with a specific file, but with a
> > FS(or the pmem device).  I think it's easier to understand and implement.

Effectively, yes. But we shouldn't need to actually associate the
page with anything at the filesystem level because it is already
associated with a DAX device at a lower level via a dev_pagemap.
The hardware page fault already runs thought this code
memory_failure_dev_pagemap() before it gets to the DAX code, so
really all we need to is have that function pass us the page, offset
into the device and, say, the struct dax_device associated with that
page so we can get to the filesystem superblock we can then use for
rmap lookups on...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: 回复: Re: [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink

2020-04-28 Thread Dave Chinner
On Tue, Apr 28, 2020 at 08:37:32AM -0700, Darrick J. Wong wrote:
> On Tue, Apr 28, 2020 at 09:24:41PM +1000, Dave Chinner wrote:
> > On Tue, Apr 28, 2020 at 04:16:36AM -0700, Matthew Wilcox wrote:
> > > On Tue, Apr 28, 2020 at 05:32:41PM +0800, Ruan Shiyang wrote:
> > > > On 2020/4/28 下午2:43, Dave Chinner wrote:
> > > > > On Tue, Apr 28, 2020 at 06:09:47AM +, Ruan, Shiyang wrote:
> > > > > > 在 2020/4/27 20:28:36, "Matthew Wilcox"  写道:
> > > > > > > On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote:
> > > > > > > >   This patchset is a try to resolve the shared 'page cache' 
> > > > > > > > problem for
> > > > > > > >   fsdax.
> > > > > > > > 
> > > > > > > >   In order to track multiple mappings and indexes on one page, I
> > > > > > > >   introduced a dax-rmap rb-tree to manage the relationship.  A 
> > > > > > > > dax entry
> > > > > > > >   will be associated more than once if is shared.  At the 
> > > > > > > > second time we
> > > > > > > >   associate this entry, we create this rb-tree and store its 
> > > > > > > > root in
> > > > > > > >   page->private(not used in fsdax).  Insert (->mapping, 
> > > > > > > > ->index) when
> > > > > > > >   dax_associate_entry() and delete it when 
> > > > > > > > dax_disassociate_entry().
> > > > > > > 
> > > > > > > Do we really want to track all of this on a per-page basis?  I 
> > > > > > > would
> > > > > > > have thought a per-extent basis was more useful.  Essentially, 
> > > > > > > create
> > > > > > > a new address_space for each shared extent.  Per page just seems 
> > > > > > > like
> > > > > > > a huge overhead.
> > > > > > > 
> > > > > > Per-extent tracking is a nice idea for me.  I haven't thought of it
> > > > > > yet...
> > > > > > 
> > > > > > But the extent info is maintained by filesystem.  I think we need a 
> > > > > > way
> > > > > > to obtain this info from FS when associating a page.  May be a bit
> > > > > > complicated.  Let me think about it...
> > > > > 
> > > > > That's why I want the -user of this association- to do a filesystem
> > > > > callout instead of keeping it's own naive tracking infrastructure.
> > > > > The filesystem can do an efficient, on-demand reverse mapping lookup
> > > > > from it's own extent tracking infrastructure, and there's zero
> > > > > runtime overhead when there are no errors present.
> > > > > 
> > > > > At the moment, this "dax association" is used to "report" a storage
> > > > > media error directly to userspace. I say "report" because what it
> > > > > does is kill userspace processes dead. The storage media error
> > > > > actually needs to be reported to the owner of the storage media,
> > > > > which in the case of FS-DAX is the filesytem.
> > > > 
> > > > Understood.
> > > > 
> > > > BTW, this is the usage in memory-failure, so what about rmap?  I have 
> > > > not
> > > > found how to use this tracking in rmap.  Do you have any ideas?
> > > > 
> > > > > 
> > > > > That way the filesystem can then look up all the owners of that bad
> > > > > media range (i.e. the filesystem block it corresponds to) and take
> > > > > appropriate action. e.g.
> > > > 
> > > > I tried writing a function to look up all the owners' info of one block 
> > > > in
> > > > xfs for memory-failure use.  It was dropped in this patchset because I 
> > > > found
> > > > out that this lookup function needs 'rmapbt' to be enabled when mkfs.  
> > > > But
> > > > by default, rmapbt is disabled.  I am not sure if it matters...
> > > 
> > > I'm pretty sure you can't have shared extents on an XFS filesystem if you
> > > _don't_ have the rmapbt feature enabled.  I mean, that's why it exists.
> > 
> > You're confusing reflink with rmap. :)
> > 
> &

Re: 回复: Re: [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink

2020-04-28 Thread Dave Chinner
On Tue, Apr 28, 2020 at 04:16:36AM -0700, Matthew Wilcox wrote:
> On Tue, Apr 28, 2020 at 05:32:41PM +0800, Ruan Shiyang wrote:
> > On 2020/4/28 下午2:43, Dave Chinner wrote:
> > > On Tue, Apr 28, 2020 at 06:09:47AM +, Ruan, Shiyang wrote:
> > > > 在 2020/4/27 20:28:36, "Matthew Wilcox"  写道:
> > > > > On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote:
> > > > > >   This patchset is a try to resolve the shared 'page cache' problem 
> > > > > > for
> > > > > >   fsdax.
> > > > > > 
> > > > > >   In order to track multiple mappings and indexes on one page, I
> > > > > >   introduced a dax-rmap rb-tree to manage the relationship.  A dax 
> > > > > > entry
> > > > > >   will be associated more than once if is shared.  At the second 
> > > > > > time we
> > > > > >   associate this entry, we create this rb-tree and store its root in
> > > > > >   page->private(not used in fsdax).  Insert (->mapping, ->index) 
> > > > > > when
> > > > > >   dax_associate_entry() and delete it when dax_disassociate_entry().
> > > > > 
> > > > > Do we really want to track all of this on a per-page basis?  I would
> > > > > have thought a per-extent basis was more useful.  Essentially, create
> > > > > a new address_space for each shared extent.  Per page just seems like
> > > > > a huge overhead.
> > > > > 
> > > > Per-extent tracking is a nice idea for me.  I haven't thought of it
> > > > yet...
> > > > 
> > > > But the extent info is maintained by filesystem.  I think we need a way
> > > > to obtain this info from FS when associating a page.  May be a bit
> > > > complicated.  Let me think about it...
> > > 
> > > That's why I want the -user of this association- to do a filesystem
> > > callout instead of keeping it's own naive tracking infrastructure.
> > > The filesystem can do an efficient, on-demand reverse mapping lookup
> > > from it's own extent tracking infrastructure, and there's zero
> > > runtime overhead when there are no errors present.
> > > 
> > > At the moment, this "dax association" is used to "report" a storage
> > > media error directly to userspace. I say "report" because what it
> > > does is kill userspace processes dead. The storage media error
> > > actually needs to be reported to the owner of the storage media,
> > > which in the case of FS-DAX is the filesytem.
> > 
> > Understood.
> > 
> > BTW, this is the usage in memory-failure, so what about rmap?  I have not
> > found how to use this tracking in rmap.  Do you have any ideas?
> > 
> > > 
> > > That way the filesystem can then look up all the owners of that bad
> > > media range (i.e. the filesystem block it corresponds to) and take
> > > appropriate action. e.g.
> > 
> > I tried writing a function to look up all the owners' info of one block in
> > xfs for memory-failure use.  It was dropped in this patchset because I found
> > out that this lookup function needs 'rmapbt' to be enabled when mkfs.  But
> > by default, rmapbt is disabled.  I am not sure if it matters...
> 
> I'm pretty sure you can't have shared extents on an XFS filesystem if you
> _don't_ have the rmapbt feature enabled.  I mean, that's why it exists.

You're confusing reflink with rmap. :)

rmapbt does all the reverse mapping tracking, reflink just does the
shared data extent tracking.

But given that anyone who wants to use DAX with reflink is going to
have to mkfs their filesystem anyway (to turn on reflink) requiring
that rmapbt is also turned on is not a big deal. Especially as we
can check it at mount time in the kernel...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: 回复: Re: [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink

2020-04-28 Thread Dave Chinner
On Tue, Apr 28, 2020 at 06:09:47AM +, Ruan, Shiyang wrote:
> 
> 在 2020/4/27 20:28:36, "Matthew Wilcox"  写道:
> 
> >On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote:
> >>  This patchset is a try to resolve the shared 'page cache' problem for
> >>  fsdax.
> >>
> >>  In order to track multiple mappings and indexes on one page, I
> >>  introduced a dax-rmap rb-tree to manage the relationship.  A dax entry
> >>  will be associated more than once if is shared.  At the second time we
> >>  associate this entry, we create this rb-tree and store its root in
> >>  page->private(not used in fsdax).  Insert (->mapping, ->index) when
> >>  dax_associate_entry() and delete it when dax_disassociate_entry().
> >
> >Do we really want to track all of this on a per-page basis?  I would
> >have thought a per-extent basis was more useful.  Essentially, create
> >a new address_space for each shared extent.  Per page just seems like
> >a huge overhead.
> >
> Per-extent tracking is a nice idea for me.  I haven't thought of it 
> yet...
> 
> But the extent info is maintained by filesystem.  I think we need a way 
> to obtain this info from FS when associating a page.  May be a bit 
> complicated.  Let me think about it...

That's why I want the -user of this association- to do a filesystem
callout instead of keeping it's own naive tracking infrastructure.
The filesystem can do an efficient, on-demand reverse mapping lookup
from it's own extent tracking infrastructure, and there's zero
runtime overhead when there are no errors present.

At the moment, this "dax association" is used to "report" a storage
media error directly to userspace. I say "report" because what it
does is kill userspace processes dead. The storage media error
actually needs to be reported to the owner of the storage media,
which in the case of FS-DAX is the filesytem.

That way the filesystem can then look up all the owners of that bad
media range (i.e. the filesystem block it corresponds to) and take
appropriate action. e.g.

- if it falls in filesytem metadata, shutdown the filesystem
- if it falls in user data, call the "kill userspace dead" routines
  for each mapping/index tuple the filesystem finds for the given
  LBA address that the media error occurred.

Right now if the media error is in filesystem metadata, the
filesystem isn't even told about it. The filesystem can't even shut
down - the error is just dropped on the floor and it won't be until
the filesystem next tries to reference that metadata that we notice
there is an issue.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len

2020-02-27 Thread Dave Chinner
On Thu, Feb 27, 2020 at 10:25:17AM -0500, Vivek Goyal wrote:
> On Thu, Feb 27, 2020 at 02:11:43PM +1100, Dave Chinner wrote:
> > On Wed, Feb 26, 2020 at 11:57:56AM -0500, Vivek Goyal wrote:
> > > On Tue, Feb 25, 2020 at 02:49:30PM -0800, Dan Williams wrote:
> > > [..]
> > > > > > I'm ok with replacing blkdev_issue_zeroout() with a dax operation
> > > > > > callback that deals with page aligned entries. That change at least
> > > > > > makes the error boundary symmetric across copy_from_iter() and the
> > > > > > zeroing path.
> > > > >
> > > > > IIUC, you are suggesting that modify dax_zero_page_range() to take 
> > > > > page
> > > > > aligned start and size and call this interface from
> > > > > __dax_zero_page_range() and get rid of blkdev_issue_zeroout() in that
> > > > > path?
> > > > >
> > > > > Something like.
> > > > >
> > > > > __dax_zero_page_range() {
> > > > >   if(page_aligned_io)
> > > > > call_dax_page_zero_range()
> > > > >   else
> > > > > use_direct_access_and_memcpy;
> > > > > }
> > > > >
> > > > > And other callers of blkdev_issue_zeroout() in filesystems can migrate
> > > > > to calling dax_zero_page_range() instead.
> > > > >
> > > > > If yes, I am not seeing what advantage do we get by this change.
> > > > >
> > > > > - __dax_zero_page_range() seems to be called by only partial block
> > > > >   zeroing code. So dax_zero_page_range() call will remain unused.
> > > > >
> > > > >
> > > > > - dax_zero_page_range() will be exact replacement of
> > > > >   blkdev_issue_zeroout() so filesystems will not gain anything. Just 
> > > > > that
> > > > >   it will create a dax specific hook.
> > > > >
> > > > > In that case it might be simpler to just get rid of 
> > > > > blkdev_issue_zeroout()
> > > > > call from __dax_zero_page_range() and make sure there are no callers 
> > > > > of
> > > > > full block zeroing from this path.
> > > > 
> > > > I think you're right. The path I'm concerned about not regressing is
> > > > the error clearing on new block allocation and we get that already via
> > > > xfs_zero_extent() and sb_issue_zeroout().
> > > 
> > > Well I was wrong. I found atleast one user which uses 
> > > __dax_zero_page_range()
> > > to zero full PAGE_SIZE blocks.
> > > 
> > > xfs_io -c "allocsp 32K 0" foo.txt
> > 
> > That ioctl interface is deprecated and likely unused by any new
> > application written since 1999. It predates unwritten extents (1998)
> > and I don't think any native linux applications have ever used it. A
> > newly written DAX aware application would almost certainly not use
> > this interface.
> > 
> > IOWs, I wouldn't use it as justification for some special case
> > behaviour; I'm more likely to say "rip that ancient ioctl out" than
> > to jump through hoops because it's implementation behaviour.
> 
> Hi Dave,
> 
> Do you see any other path in xfs using iomap_zero_range() and zeroing
> full block.

Yes:

- xfs_file_aio_write_checks() for zeroing blocks between the
  existing EOF and the start of the incoming write beyond EOF
- xfs_setattr_size() on truncate up for zeroing blocks between the
  existing EOF and the new EOF.
- xfs_reflink_zero_posteof() for zeroing blocks between the old EOF
  and where the new reflinked extents are going to land beyond EOF

And don't think that blocks beyond EOF can't exist when DAX is
enabled. We can turn DAX on and off, we can crash between allocation
and file size extension, etc. Hence this code must be able to handle
zeroing large ranges of blocks beyond EOF...

> iomap_zero_range() already skips IOMAP_HOLE and
> IOMAP_UNWRITTEN. So it has to be a full block zeroing which is of not type
> IOMAP_HOLE and IOMAP_UNWRITTEN.
> 
> My understanding is that ext4 and xfs both are initializing full blocks
> using blkdev_issue_zeroout(). Only partial blocks are being zeroed using
> this dax zeroing path.

Look at the API, not the callers: iomap_zero_range takes a 64 bit
length parameter. It can be asked to zero blocks across petabytes of
a file. If there's a single block somewhere in that range, it will
only zero that block. If the entire range is allocated, it will zero
that entire range (yes, it will take forever!) as that it what it
is intended to do.

It should be pretty clear that needs to be able to zero entire
pages, regardless of how it is currently used/called by filesystems.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len

2020-02-27 Thread Dave Chinner
On Wed, Feb 26, 2020 at 08:19:37PM -0800, Dan Williams wrote:
> On Wed, Feb 26, 2020 at 7:03 PM Dave Chinner  wrote:
> > On Mon, Feb 24, 2020 at 10:38:44AM -0500, Vivek Goyal wrote:
> > > Anyway, partial page truncate can't ensure that data in rest of the page 
> > > can
> > > be read back successfully. Memory can get poison after the write and
> > > hence read after truncate will still fail.
> >
> > Which is where the notification requirement comes in. Yes, we may
> > still get errors on read or write, but if memory at rest goes bad,
> > we want to handle that and correct it ASAP, not wait days or months
> > for something to trip over the poisoned page before we find out
> > about it.
> >
> > > Hence, all we are trying to ensure is that if a poison is known at the
> > > time of writing partial page, then we should return error to user space.
> >
> > I think within FS-DAX infrastructure, any access to the data (read
> > or write) within a poisoned page or a page marked with PageError()
> > should return EIO to the caller, unless it's the specific command to
> > clear the error/poison state on the page. What happens with that
> > error state is then up to the caller.
> >
> 
> I agree with most of the above if you replace "device-dax error
> handling" with "System RAM error handling". It's typical memory error
> handling that injects the page->index and page->mappping dependency.

I disagree, but that's beside the point and not worth arguing.

> So you want the FS to have error handling for just pmem errors or all
> memory errors?

Just pmem errors in the specific range the filesystem manages - we
really only care storage errors because those are the only errors
the filesystem is responsible for handling correctly.

Somebody else can worry about errors that hit page cache pages -
page cache pages require mapping/index pointers on each page anyway,
so a generic mechanism for handling those errors can be built into
the page cache. And if the error is in general kernel memory, then
it's game over for the entire kernel at that point, not just the
filesystem.

> And you want this to be done without the mm core using
> page->index to identify what to unmap when the error happens?

Isn't that exactly what I just said? We get the page address that
failed, the daxdev can turn that into a sector address and call into
the filesystem with a {sector, len, errno} tuple. We then do a
reverse mapping lookup on {sector, len} to find all the owners of
that range in the filesystem. If it's file data, that owner record
gives us the inode and the offset into the file, which then gives us
a {mapping, index} tuple.

Further, the filesytem reverse map is aware that it's blocks can be
shared by multiple owners, so it will have a record for every inode
and file offset that maps to that page. Hence we can simply iterate
the reverse map and do that whacky collect_procs/kill_procs dance
for every {mapping, index} pair that references the the bad range.

Nothing ever needs to be stored on the struct page...

> Memory
> error scanning is not universally available on all pmem
> implementations, so FS will need to subscribe for memory-error
> handling events.

No. Filesystems interact with the underlying device abstraction, not
the physical storage that lies behind that device abstraction.  The
filesystem should not interface with pmem directly in any way (all
direct accesses are hidden inside fs/dax.c!), nor should it care
about the quirks of the pmem implementation it is sitting on. That's
what the daxdev abstraction is supposed to hide from the users of
the pmem.

IOWs, the daxdev infrastructure subscribes to memory-error event
subsystem, calls out to the filesystem when an error in a page in
the daxdev is reported. The filesystem only needs to know the
{sector, len, errno} tuple related to the error; it is the device's
responsibility to handle the physical mechanics of listening,
filtering and translating MCEs to a format the filesystem
understands

Another reason it should be provided by the daxdev as a {sector,
len, errno} tuple is that it also allows non-dax block devices to
implement the similar error notifications and provide filesystems
with exactly the same information so the filesystem can start
auto-recovery processes

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len

2020-02-26 Thread Dave Chinner
On Wed, Feb 26, 2020 at 11:57:56AM -0500, Vivek Goyal wrote:
> On Tue, Feb 25, 2020 at 02:49:30PM -0800, Dan Williams wrote:
> [..]
> > > > I'm ok with replacing blkdev_issue_zeroout() with a dax operation
> > > > callback that deals with page aligned entries. That change at least
> > > > makes the error boundary symmetric across copy_from_iter() and the
> > > > zeroing path.
> > >
> > > IIUC, you are suggesting that modify dax_zero_page_range() to take page
> > > aligned start and size and call this interface from
> > > __dax_zero_page_range() and get rid of blkdev_issue_zeroout() in that
> > > path?
> > >
> > > Something like.
> > >
> > > __dax_zero_page_range() {
> > >   if(page_aligned_io)
> > > call_dax_page_zero_range()
> > >   else
> > > use_direct_access_and_memcpy;
> > > }
> > >
> > > And other callers of blkdev_issue_zeroout() in filesystems can migrate
> > > to calling dax_zero_page_range() instead.
> > >
> > > If yes, I am not seeing what advantage do we get by this change.
> > >
> > > - __dax_zero_page_range() seems to be called by only partial block
> > >   zeroing code. So dax_zero_page_range() call will remain unused.
> > >
> > >
> > > - dax_zero_page_range() will be exact replacement of
> > >   blkdev_issue_zeroout() so filesystems will not gain anything. Just that
> > >   it will create a dax specific hook.
> > >
> > > In that case it might be simpler to just get rid of blkdev_issue_zeroout()
> > > call from __dax_zero_page_range() and make sure there are no callers of
> > > full block zeroing from this path.
> > 
> > I think you're right. The path I'm concerned about not regressing is
> > the error clearing on new block allocation and we get that already via
> > xfs_zero_extent() and sb_issue_zeroout().
> 
> Well I was wrong. I found atleast one user which uses __dax_zero_page_range()
> to zero full PAGE_SIZE blocks.
> 
> xfs_io -c "allocsp 32K 0" foo.txt

That ioctl interface is deprecated and likely unused by any new
application written since 1999. It predates unwritten extents (1998)
and I don't think any native linux applications have ever used it. A
newly written DAX aware application would almost certainly not use
this interface.

IOWs, I wouldn't use it as justification for some special case
behaviour; I'm more likely to say "rip that ancient ioctl out" than
to jump through hoops because it's implementation behaviour.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len

2020-02-26 Thread Dave Chinner
On Mon, Feb 24, 2020 at 10:38:44AM -0500, Vivek Goyal wrote:
> On Mon, Feb 24, 2020 at 10:03:30AM +1100, Dave Chinner wrote:
> 
> [..]
> > > > > Hi Jeff,
> > > > >
> > > > > New dax zeroing interface (dax_zero_page_range()) can technically pass
> > > > > a range which is less than a sector. Or which is bigger than a sector
> > > > > but start and end are not aligned on sector boundaries.
> > > > 
> > > > Sure, but who will call it with misaligned ranges?
> > > 
> > > create a file foo.txt of size 4K and then truncate it.
> > > 
> > > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
> > > 4095.
> > 
> > This should fail with EIO. Only full page writes should clear the
> > bad page state, and partial writes should therefore fail because
> > they do not guarantee the data in the filesystem block is all good.
> > 
> > If this zeroing was a buffered write to an address with a bad
> > sector, then the writeback will fail and the user will (eventually)
> > get an EIO on the file.
> > 
> > DAX should do the same thing, except because the zeroing is
> > synchronous (i.e. done directly by the truncate syscall) we can -
> > and should - return EIO immediately.
> > 
> > Indeed, with your code, if we then extend the file by truncating up
> > back to 4k, then the range between 23 and 512 is still bad, even
> > though we've successfully zeroed it and the user knows it. An
> > attempt to read anywhere in this range (e.g. 10 bytes at offset 100)
> > will fail with EIO, but reading 10 bytes at offset 2000 will
> > succeed.
> 
> Hi Dave,
> 
> What is expected if I do "truncate -s 512 foo.txt". Say first sector (0 to
> 511) is poisoned and rest don't have poison. Should this fail with -EIO.

Yes - the filesystem block still contains bad data.

> In current implementation it does not.

I'm not surprised - the whole hardware error handling architecture
for FS-DAX is fundamentally broken. It was designed for device-dax,
and it just doesn't work for FS-DAX.

For example, to get the hardware error handling to be able to kill
userspace applications, a 1:1 physical-to-logical association
constraint was added to fs/dax.c:

/*
 * TODO: for reflink+dax we need a way to associate a single page with
 * multiple address_space instances at different linear_page_index()
 * offsets.
 */
static void dax_associate_entry(void *entry, struct address_space *mapping,
struct vm_area_struct *vma, unsigned long address)

because device-dax only has *linear mappings* and so has a fixed
reverse mapping architecture.

i.e. the whole hardware error handling infrastructure was designed
around the constraints of device-dax. device-dax does not having any
structure to serialise access to the physical storage, so locking
was added to the mapping tree. THe mapping tree locking is accessed
on hardware error via the reverse mappingi association in the struct
page and that's how device-dax serialises direct physical storage
access against hardware error processing.  And while the page index
is locked in the mapping tree, it can walk the process vmas that
have the page mapped to kill them so that they don't try to access
the bad page.

That bad physical storage state is carried in a volatile struct page
flag, hence requiring some mechanism to make it persistent (the
device bad blocks register) and some other mechanism to clear the
poison state (direct IO, IIRC).

It's also a big, nasty, solid roadblock to implementing shared
data extents in FS-DAX. We basically have to completely re-architect
the hardware error handling for FS-DAX so that such errors are
reported to the filesystem first and then the filesystem does what
is needed to handle the error.

None of this works for filesystems because they need to perform
different operations depending on what the page that went bad
contains. FS-DAX should never trip over an unexpected poisoned page;
we do so now because such physical storage errors are completely
hidden form the fielsystem.

What you are trying to do is slap a band-aid over what to do when we
hit an unexpected page containing bad data. Filesystems expect to
find out about bad data in storage when they marshall the data into
or out of memory. They make the assumption that once it is in memory
it remains valid on the physical storage. Hence if an in-memory
error occurs, we can just toss it away and re-read it from storage,
and all is good.

FS-DAX changes that - we are no longer marshalling data into and out
of memory so we don't have a mechanism to get EIO when reading the
page into the page cache or writing it back to disk. We also don't
have an in-memory copy of the data - the physical storage is the
in-memory copy

Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len

2020-02-23 Thread Dave Chinner
ch cases.
> 
> Do we expect an interface where if there are any bad blocks in the range
> being zeroed, then they all should be cleared (and hence all I/O should
> be aligned) otherwise error is returned. If yes, I could make that
> change.
> 
> Downside of current interface is that it will clear as many blocks as
> possible in the given range and leave starting and end blocks poisoned
> (if it is unaligned) and not return error. That means a reader will
> get error on these blocks again and they will have to try to clear it
> again.

Which is solved by having partial page writes always EIO on poisoned
memory.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [RFC PATCH v2 0/7] xfs: reflink & dedupe for fsdax (read/write path).

2019-11-14 Thread Dave Chinner
On Thu, Nov 07, 2019 at 07:30:32PM -0800, Dan Williams wrote:
> On Thu, Nov 7, 2019 at 7:11 PM Shiyang Ruan  
> wrote:
> >
> > Hi Darrick, Dave,
> >
> > Do you have any comment on this?
> 
> Christoph pointed out at ALPSS that this problem has significant
> overlap with the shared page-cache for reflink problem. So I think we
> need to solve that first and then circle back to dax reflink support.
> I'm starting to take a look at that.

I think the DAX side is somewhat simpler because it doesn't really
need to involve the page cache and we don't have to worry about
subtly breaking random filesystems. Hence I'd suggest we sort out a
solution for DAX first, then worry about page cache stuff. The
shared page cache for reflink feature is not a show stopper -
multiple references for DAX is a show stopper. Let's deal with the
DAX problem first.

Cheers,

-Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: Lease semantic proposal

2019-10-10 Thread Dave Chinner
On Tue, Oct 01, 2019 at 02:01:57PM -0700, Ira Weiny wrote:
> On Mon, Sep 30, 2019 at 06:42:33PM +1000, Dave Chinner wrote:
> > On Wed, Sep 25, 2019 at 04:46:03PM -0700, Ira Weiny wrote:
> > > On Tue, Sep 24, 2019 at 08:26:20AM +1000, Dave Chinner wrote:
> > > > Hence, AFIACT, the above definition of a F_RDLCK|F_LAYOUT lease
> > > > doesn't appear to be compatible with the semantics required by
> > > > existing users of layout leases.
> > > 
> > > I disagree.  Other than the addition of F_UNBREAK, I think this is 
> > > consistent
> > > with what is currently implemented.  Also, by exporting all this to user 
> > > space
> > > we can now write tests for it independent of the RDMA pinning.
> > 
> > The current usage of F_RDLCK | F_LAYOUT by the pNFS code allows
> > layout changes to occur to the file while the layout lease is held.
> 
> This was not my understanding.

These are the remote procerdeure calls that the pNFS client uses to
map and/or allocate space in the file it has a lease on:

struct export_operations {

int (*map_blocks)(struct inode *inode, loff_t offset,
  u64 len, struct iomap *iomap,
  bool write, u32 *device_generation);
int (*commit_blocks)(struct inode *inode, struct iomap *iomaps,
 int nr_iomaps, struct iattr *iattr);
};

.map_blocks() allows the pnfs client to allocate blocks in the
storage.  .commit_blocks() is called once the write is complete to
do things like unwritten extent conversion on extents that it
allocated. In the XFS implementation of these methods, they call
directly into the XFS same block mapping code that the
read/write/mmap IO paths call into.

A typical pNFS use case is a HPC clusters, where thousands of nodes
might all be writing to separate parts of a huge sparse file (e.g.
out of core sparse matrix solver) and are reading/writing direct to
the storage via iSER or some other low level IB/RDMA storage
protocol.  Every write on every pNFS client needs space allocation,
so the pNFS server is basically just providing a remote interface to
the XFS space allocation interfaces for direct IO on the pNFS
clients.

IOWs, there can be thousands of concurrent pNFS layout leases on a
single inode at any given time and they can all be allocating space,
too.

> > IOWs, your definition of F_RDLCK | F_LAYOUT not being allowed
> > to change the is in direct contradition to existing users.
> > 
> > I've said this several times over the past few months now: shared
> > layout leases must allow layout modifications to be made.
> 
> I don't understand what the point of having a layout lease is then?

It's a *notification* mechanism.

Multiple processes can modify the file layout at the same time -
XFs was designed as a multi-write filesystem from the ground up and
we make use of that with shared IO locks for direct IO writes. 

The read layout lease model we've used for pNFS is essentially the
same concurrent writer model that direct IO in XFS uses. And to
enable concurrent writers, we use shared locking for the the layout
leases.

IOWs, the pNFS client IO model is very similar to local direct IO,
except for the fact they can remotely cache layout mappings.  Hence
if you do a server-side local buffered write (delayed allocation),
truncate, punch a hole, etc, (or a remote operation through the NFS
server that ends up in these same paths) the mappings those pNFS
clients hold are no longer guaranteed to cover valid data and/or
have correct physical mappings for valid data held on the server.

At this point, the layouts need to be invalidated, and so the layout
lease is broken by the filesystem operations that may cause an
issue. The pNFS server reacts to the lease break by recalling the
client layout(s) and the pNFS client has to request a new layout
from the server to be able to directly access the storage again.

i.e. the layout lease is primarily a *notification* mechanism to
allow safe interop between application level direct access
mechanisms and local filesystem access.

What you are trying to do is turn this multi-writer layout lease
notification mechanism into a single writer access control
mechanism. i.e. F_UNBREAK is all about /control/ of the layout and
who can and can't modify it, regardless of whether they write
permissions have been granted or not.

It seems I have been unable to get this point across despite trying
for months now: access control is not a function provided by layout
leases. If you need to guarantee exclusive access to a file so
nobody else can modify it, direct access or through the filesystem,
then that is what permission bits, ACLs, file locks, LSMs, etc are
for. Don't try to overload a layout change notification mechanism
with data access controls.

> I apologize for not understan

Re: [RFC PATCH 0/7] xfs: add reflink & dedupe support for fsdax.

2019-10-10 Thread Dave Chinner
On Wed, Oct 09, 2019 at 10:11:52AM -0700, Darrick J. Wong wrote:
> On Tue, Oct 08, 2019 at 11:31:44PM -0700, Christoph Hellwig wrote:
> > Btw, I just had a chat with Dan last week on this.  And he pointed out
> > that while this series deals with the read/write path issues of 
> > reflink on DAX it doesn't deal with the mmap side issue that
> > page->mapping and page->index can point back to exactly one file.
> > 
> > I think we want a few xfstests that reflink a file and then use the
> > different links using mmap, as that should blow up pretty reliably.
> 
> Hmm, you're right, we don't actually have a test that checks the
> behavior of mwriting all copies of a shared block.  Ok, I'll go write
> one.

I've pointed this problem out to everyone who has asked me "what do
we need to do to support reflink on DAX". I've even walked a couple
of people right through the problem that needs to be solved and
discussed the potential solutions to it.

Problems that I think need addressing:

- device dax and filesystem dax have fundamentally different
  needs in this space, so they need to be separated and not
  try to use the same solution.
- dax_lock_entry() being used as a substitute for
  page_lock() but it not being held on the page itself means
  it can't be extended to serialise access to the page
  across multiple mappings that are unaware of each other
- dax_lock_page/dax_unlock_page interface for hardware
  memory errors needs to report to the
  filesystem for processing and repair, not assume the page
  is user data and killing processes is the only possible
  recovery mechanism.
- dax_associate_entry/dax_disassociate_entry can only work
  for a 1:1 page:mapping,index relationship. It needs to go
  away and be replaced by a mechanism that allows
  tracking multiple page mapping/index/state tuples. This
  has much wider use than DAX (e.g. sharing page cache pages
  between reflinked files)

I've proposed shadow pages (based on a concept from Matethw Wilcox)
for each read-only reflink mapping with the real physical page being
owned by the filesystem and indexed by LBA in the filesystem buffer
cache. This would be based on whether the extent in the file the
page is mapped from has multiple references to it.

i.e. When a new page mapping occurs in a shared extent, we add the
page to the buffer cache (i.e. point a struct xfs_buf at it)i if it
isn't already present, then allocate a shadow page, point it at the
master, set it up with the new mapping,index tuple and add it to the
mapping tree. Then we can treat it as a unique page even though it
points to the read-only master page.

When the page get's COWed, we toss away the shadow page and the
master can be reclaimed with the reference count goes to zero or the
extent is no longer shared.  Think of it kind of like the way we
multiply reference the zero page for holes in mmap()d dax regions,
except we can have millions of them and they are found by physical
buffer cache index lookups. 

This works for both DAX and non-DAX sharing of read-only shared
filesytsem pages. i.e. it would form the basis of single-copy
read-only page cache pages for reflinked files.

There was quite a bit of talk at LSFMM 2018 about having a linked
list of mapping structures hanging off a struct page, one for each
mapping that references the page. Operations would then have to walk
all mappings that reference the page. This was useful to other
subsystems (HMM?) for some purpose I forget, but I'm not sure it's
particularly useful by itself for non-dax reflink purposes - I
suspect the filesystem would still need to track such pages itself
in it's buffer cache so it can find the cached page to link new
reflink copies to the same page...

ISTR a couple of other solutions were thrown around, but I don't
think anyone came up with a simple solution...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: Lease semantic proposal

2019-09-23 Thread Dave Chinner
. Single writers are something we
are always trying to avoid in XFS.

> Operations which change the layout are allowed by that process.  But 
> operations
> from other file descriptors which attempt to change the layout will break the
> lease through the standard lease break process.

If the F_WRLCK|F_LAYOUT lease is exclusive, who is actually able to
modify the layout?  Are you talking about processes that don't
actually hold leases modifying the layout? i.e. what are the
constraints on "exclusive access" here - is F_WRLCK|F_LAYOUT is
only exclusive when every modification is co-operating and taking
the appropriate layout lease for every access to the file that is
made?

If that's the case, what happens when someone fails to get a read
lock and decides "I can break write locks just by using ftruncate()
to the same size without a layout lease". Or fallocate() to
preallocate space that is already allocated. Or many other things I
can think of.

IOWs, this seems to me like a very fragile sort of construct that is
open to abuse and that will lead to everyone using F_UNBREAK, which
is highly unfriendly to everyone else...

> The F_LAYOUT flag is used to
> indicate a difference between a regular F_WRLCK and F_WRLCK with F_LAYOUT.  In
> the F_LAYOUT case opens for write do not break the lease.  But some 
> operations,
> if they change the underlying layout, may.
> 
> The distinction between read layout leases and write layout leases is that
> write layout leases can change the layout without breaking the lease within 
> the
> owning process.  This is useful to guarantee a layout prior to specifying the
> unbreakable flag described below.

Ok, so now you really are saying that F_RDLCK leases can only be
used on O_RDONLY file descriptors because any modification under a
F_RDLCK|LAYOUT will trigger a layout break.

> **Unbreakable Layout Leases (F_UNBREAK)**
> 
> In order to support pinning of file pages by direct user space users an
> unbreakable flag (F_UNBREAK) can be used to modify the read and write layout
> lease.  When specified, F_UNBREAK indicates that any user attempting to break
> the lease will fail with ETXTBUSY rather than follow the normal breaking
> procedure.
> 
> Both read and write layout leases can have the unbreakable flag (F_UNBREAK)
> specified.  The difference between an unbreakable read layout lease and an
> unbreakable write layout lease are that an unbreakable read layout lease is
> _not_ exclusive. 

Oh, this doesn't work at all. Now we get write()s to F_RDLCK leases
that can't break the leases and so all writes, even to processes
that own RDLCK|UNBREAK, will fail with ETXTBSY.

> This means that once a layout is established on a file,
> multiple unbreakable read layout leases can be taken by multiple processes and
> used to pin the underlying pages of that file.

Ok, so what happens when someone now takes a
F_WRLOCK|F_LAYOUT|F_UNBREAK? Is that supposed to break
F_RDLCK|F_LAYOUT|F_UNBREAK, as the wording about F_WRLCK behaviour
implies it should?

> Care must therefore be taken to ensure that the layout of the file is as the
> user wants prior to using the unbreakable read layout lease.  A safe mechanism
> to do this would be to take a write layout lease and use fallocate() to set 
> the
> layout of the file.  The layout lease can then be "downgraded" to unbreakable
> read layout as long as no other user broke the write layout lease.

What are the semantics of this "downgrade" behaviour you speak of? :)

My thoughts are:
- RDLCK can only be used for O_RDONLY because write()
  requires breaking of leases
    - WRLCK is open to abuse simply by not using a layout lease
  to do a "no change" layout modification
- RDLCK|F_UNBREAK is entirely unusable
- WRLCK|F_UNBREAK will be what every application uses
  because everything else either doesn't work or is too easy
  to abuse.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 01/19] dax: remove block device dependencies

2019-08-28 Thread Dave Chinner
On Wed, Aug 28, 2019 at 01:58:43PM -0400, Vivek Goyal wrote:
> On Tue, Aug 27, 2019 at 11:58:09PM -0700, Christoph Hellwig wrote:
> > On Tue, Aug 27, 2019 at 12:38:28PM -0400, Vivek Goyal wrote:
> > > > For bdev_dax_pgoff
> > > > I'd much rather have the partition offset if there is on in the daxdev
> > > > somehow so that we can get rid of the block device entirely.
> > > 
> > > IIUC, there is one block_device per partition while there is only one
> > > dax_device for the whole disk. So we can't directly move bdev logical
> > > offset into dax_device.
> > 
> > Well, then we need to find a way to get partitions for dax devices,
> > as we really should not expect a block device hiding behind a dax
> > dev.  That is just a weird legacy assumption - block device need to
> > layer on top of the dax device optionally.
> > 
> > > 
> > > We probably could put this in "iomap" and leave it to filesystems to
> > > report offset into dax_dev in iomap that way dax generic code does not
> > > have to deal with it. But that probably will be a bigger change.
> > 
> > And where would the file system get that information from?
> 
> File system knows about block device, can it just call get_start_sect()
> while filling iomap->addr. And this means we don't have to have
> parition information in dax device. Will something like following work?
> (Just a proof of concept patch).
> 
> 
> ---
>  drivers/dax/super.c |   11 +++
>  fs/dax.c|6 +++---
>  fs/ext4/inode.c |6 +-
>  include/linux/dax.h |1 +
>  4 files changed, 20 insertions(+), 4 deletions(-)
> 
> Index: rhvgoyal-linux/fs/ext4/inode.c
> ===
> --- rhvgoyal-linux.orig/fs/ext4/inode.c   2019-08-28 13:51:16.051937204 
> -0400
> +++ rhvgoyal-linux/fs/ext4/inode.c2019-08-28 13:51:44.453937204 -0400
> @@ -3589,7 +3589,11 @@ retry:
>   WARN_ON_ONCE(1);
>   return -EIO;
>   }
> - iomap->addr = (u64)map.m_pblk << blkbits;
> + if (IS_DAX(inode))
> + iomap->addr = ((u64)map.m_pblk << blkbits) +
> +   (get_start_sect(iomap->bdev) * 512);
> + else
> + iomap->addr = (u64)map.m_pblk << blkbits;

I'm not a fan of returning a physical device sector address from an
interface where ever other user/caller expects this address to be a
logical block address into the block device. It creates a landmine
in the iomap API that callers may not be aware of and that's going
to cause bugs. We're trying really hard to keep special case hacks
like this out of the iomap infrastructure, so on those grounds alone
I'd suggest this is a dead end approach.

Hence I think that if the dax device needs a physical offset from
the start of the block device the filesystem sits on, it should be
set up at dax device instantiation time and so the filesystem/bdev
never needs to be queried again for this information.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ; -)

2019-08-25 Thread Dave Chinner
On Fri, Aug 23, 2019 at 10:08:36PM -0700, Ira Weiny wrote:
> On Sat, Aug 24, 2019 at 10:11:24AM +1000, Dave Chinner wrote:
> > On Fri, Aug 23, 2019 at 09:04:29AM -0300, Jason Gunthorpe wrote:
> > > On Fri, Aug 23, 2019 at 01:23:45PM +1000, Dave Chinner wrote:
> > > 
> > > > > But the fact that RDMA, and potentially others, can "pass the
> > > > > pins" to other processes is something I spent a lot of time trying to 
> > > > > work out.
> > > > 
> > > > There's nothing in file layout lease architecture that says you
> > > > can't "pass the pins" to another process.  All the file layout lease
> > > > requirements say is that if you are going to pass a resource for
> > > > which the layout lease guarantees access for to another process,
> > > > then the destination process already have a valid, active layout
> > > > lease that covers the range of the pins being passed to it via the
> > > > RDMA handle.
> > > 
> > > How would the kernel detect and enforce this? There are many ways to
> > > pass a FD.
> > 
> > AFAIC, that's not really a kernel problem. It's more of an
> > application design constraint than anything else. i.e. if the app
> > passes the IB context to another process without a lease, then the
> > original process is still responsible for recalling the lease and
> > has to tell that other process to release the IB handle and it's
> > resources.
> > 
> > > IMHO it is wrong to try and create a model where the file lease exists
> > > independently from the kernel object relying on it. In other words the
> > > IB MR object itself should hold a reference to the lease it relies
> > > upon to function properly.
> > 
> > That still doesn't work. Leases are not individually trackable or
> > reference counted objects objects - they are attached to a struct
> > file bUt, in reality, they are far more restricted than a struct
> > file.
> > 
> > That is, a lease specifically tracks the pid and the _open fd_ it
> > was obtained for, so it is essentially owned by a specific process
> > context.  Hence a lease is not able to be passed to a separate
> > process context and have it still work correctly for lease break
> > notifications.  i.e. the layout break signal gets delivered to
> > original process that created the struct file, if it still exists
> > and has the original fd still open. It does not get sent to the
> > process that currently holds a reference to the IB context.
> >
> 
> The fcntl man page says:
> 
> "Leases are associated with an open file description (see open(2)).  This 
> means
> that duplicate file descriptors (created by, for example, fork(2) or dup(2))
> refer to the same lease, and this lease may be modified or released using any
> of these descriptors.  Furthermore,  the lease is released by either an
> explicit F_UNLCK operation on any of these duplicate file descriptors, or when
> all such file descriptors have been closed."

Right, the lease is attached to the struct file, so it follows
where-ever the struct file goes. That doesn't mean it's actually
useful when the struct file is duplicated and/or passed to another
process. :/

AFAICT, the problem is that when we take another reference to the
struct file, or when the struct file is passed to a different
process, nothing updates the lease or lease state attached to that
struct file.

> From this I took it that the child process FD would have the lease as well
> _and_ could release it.  I _assumed_ that applied to SCM_RIGHTS but it does 
> not
> seem to work the same way as dup() so I'm not so sure.

Sure, that part works because the struct file is passed. It doesn't
end up with the same fd number in the other process, though.

The issue is that layout leases need to notify userspace when they
are broken by the kernel, so a lease stores the owner pid/tid in the
file->f_owner field via __f_setown(). It also keeps a struct fasync
attached to the file_lock that records the fd that the lease was
created on.  When a signal needs to be sent to userspace for that
lease, we call kill_fasync() and that walks the list of fasync
structures on the lease and calls:

send_sigio(fown, fa->fa_fd, band);

And it does for every fasync struct attached to a lease. Yes, a
lease can track multiple fds, but it can only track them in a single
process context. The moment the struct file is shared with another
process, the lease is no longer capable of sending notifications to
all the lease holders.

Yes, you can change the owning process via F_SETOWNER, but that's
still only a single process context, and

Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ; -)

2019-08-23 Thread Dave Chinner
On Fri, Aug 23, 2019 at 10:15:04AM -0700, Ira Weiny wrote:
> On Fri, Aug 23, 2019 at 10:59:14AM +1000, Dave Chinner wrote:
> > On Wed, Aug 21, 2019 at 11:02:00AM -0700, Ira Weiny wrote:
> > > On Tue, Aug 20, 2019 at 08:55:15AM -0300, Jason Gunthorpe wrote:
> > > > On Tue, Aug 20, 2019 at 11:12:10AM +1000, Dave Chinner wrote:
> > > > > On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote:
> > > > > > On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote:
> > > > > > 
> > > > > > > So that leaves just the normal close() syscall exit case, where 
> > > > > > > the
> > > > > > > application has full control of the order in which resources are
> > > > > > > released. We've already established that we can block in this
> > > > > > > context.  Blocking in an interruptible state will allow fatal 
> > > > > > > signal
> > > > > > > delivery to wake us, and then we fall into the
> > > > > > > fatal_signal_pending() case if we get a SIGKILL while blocking.
> > > > > > 
> > > > > > The major problem with RDMA is that it doesn't always wait on 
> > > > > > close() for the
> > > > > > MR holding the page pins to be destoyed. This is done to avoid a
> > > > > > deadlock of the form:
> > > > > > 
> > > > > >uverbs_destroy_ufile_hw()
> > > > > >   mutex_lock()
> > > > > >[..]
> > > > > > mmput()
> > > > > >  exit_mmap()
> > > > > >   remove_vma()
> > > > > >fput();
> > > > > > file_operations->release()
> > > > > 
> > > > > I think this is wrong, and I'm pretty sure it's an example of why
> > > > > the final __fput() call is moved out of line.
> > > > 
> > > > Yes, I think so too, all I can say is this *used* to happen, as we
> > > > have special code avoiding it, which is the code that is messing up
> > > > Ira's lifetime model.
> > > > 
> > > > Ira, you could try unraveling the special locking, that solves your
> > > > lifetime issues?
> > > 
> > > Yes I will try to prove this out...  But I'm still not sure this fully 
> > > solves
> > > the problem.
> > > 
> > > This only ensures that the process which has the RDMA context (RDMA FD) 
> > > is safe
> > > with regard to hanging the close for the "data file FD" (the file which 
> > > has
> > > pinned pages) in that _same_ process.  But what about the scenario.
> > > 
> > > Process A has the RDMA context FD and data file FD (with lease) open.
> > > 
> > > Process A uses SCM_RIGHTS to pass the RDMA context FD to Process B.
> > 
> > Passing the RDMA context dependent on a file layout lease to another
> > process that doesn't have a file layout lease or a reference to the
> > original lease should be considered a violation of the layout lease.
> > Process B does not have an active layout lease, and so by the rules
> > of layout leases, it is not allowed to pin the layout of the file.
> > 
> 
> I don't disagree with the semantics of this.  I just don't know how to enforce
> it.
> 
> > > Process A attempts to exit (hangs because data file FD is pinned).
> > > 
> > > Admin kills process A.  kill works because we have allowed for it...
> > > 
> > > Process B _still_ has the RDMA context FD open _and_ therefore still 
> > > holds the
> > > file pins.
> > > 
> > > Truncation still fails.
> > > 
> > > Admin does not know which process is holding the pin.
> > > 
> > > What am I missing?
> > 
> > Application does not hold the correct file layout lease references.
> > Passing the fd via SCM_RIGHTS to a process without a layout lease
> > is equivalent to not using layout leases in the first place.
> 
> Ok, So If I understand you correctly you would support a failure of SCM_RIGHTS
> in this case?  I'm ok with that but not sure how to implement it right now.
> 
> To that end, I would like to simplify this slightly because I'm not convinced
> that SCM_RIGHTS is a problem we need to solve right now.  ie I don't know of a
> user who wants to do this.

I don't think we can support it, let alone want to. SCM_RIGHTS was a
mistake made years ago that has been causing bugs and complexity to
try and avoid those bugs ever since.  I'm only taking about it
because someone else raised it and I asummed they raised it because
they want it to "work".

> Right now duplication via SCM_RIGHTS could fail if _any_ file pins (and by
> definition leases) exist underneath the "RDMA FD" (or other direct access FD,
> like XDP etc) being duplicated.

Sounds like a fine idea to me.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ; -)

2019-08-23 Thread Dave Chinner
On Fri, Aug 23, 2019 at 09:04:29AM -0300, Jason Gunthorpe wrote:
> On Fri, Aug 23, 2019 at 01:23:45PM +1000, Dave Chinner wrote:
> 
> > > But the fact that RDMA, and potentially others, can "pass the
> > > pins" to other processes is something I spent a lot of time trying to 
> > > work out.
> > 
> > There's nothing in file layout lease architecture that says you
> > can't "pass the pins" to another process.  All the file layout lease
> > requirements say is that if you are going to pass a resource for
> > which the layout lease guarantees access for to another process,
> > then the destination process already have a valid, active layout
> > lease that covers the range of the pins being passed to it via the
> > RDMA handle.
> 
> How would the kernel detect and enforce this? There are many ways to
> pass a FD.

AFAIC, that's not really a kernel problem. It's more of an
application design constraint than anything else. i.e. if the app
passes the IB context to another process without a lease, then the
original process is still responsible for recalling the lease and
has to tell that other process to release the IB handle and it's
resources.

> IMHO it is wrong to try and create a model where the file lease exists
> independently from the kernel object relying on it. In other words the
> IB MR object itself should hold a reference to the lease it relies
> upon to function properly.

That still doesn't work. Leases are not individually trackable or
reference counted objects objects - they are attached to a struct
file bUt, in reality, they are far more restricted than a struct
file.

That is, a lease specifically tracks the pid and the _open fd_ it
was obtained for, so it is essentially owned by a specific process
context. Hence a lease is not able to be passed to a separate
process context and have it still work correctly for lease break
notifications.  i.e. the layout break signal gets delivered to
original process that created the struct file, if it still exists
and has the original fd still open. It does not get sent to the
process that currently holds a reference to the IB context.

So while a struct file passed to another process might still have
an active lease, and you can change the owner of the struct file
via fcntl(F_SETOWN), you can't associate the existing lease with a
the new fd in the new process and so layout break signals can't be
directed at the lease fd

This really means that a lease can only be owned by a single process
context - it can't be shared across multiple processes (so I was
wrong about dup/pass as being a possible way of passing them)
because there's only one process that can "own" a struct file, and
that where signals are sent when the lease needs to be broken.

So, fundamentally, if you want to pass a resource that pins a file
layout between processes, both processes need to hold a layout lease
on that file range. And that means exclusive leases and passing
layouts between processes are fundamentally incompatible because you
can't hold two exclusive leases on the same file range

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ; -)

2019-08-22 Thread Dave Chinner
m design perspective.

That's where the "nice and clean" semantics come from - starting
from "what can we actually support?", "what exactly do all the
different direct access mechanisms actually require?", "does it work
for future technologies not yet on our radar?" and working from
there.  So I'm not just looking at what we are doing right now, I'm
looking at 15 years down the track when we still have to support
layout leases and we've got hardware we haven't dreamed of yet.  If
the model is clean, simple, robust, implementation independent and
has well defined semantics, then it should stand the test of time.
i.e. the "nice and clean" semantics have nothign to do with the
filesystem per se, but everything to do with ensuring the mechanism
is generic and workable for direct storage access for a long time
into the future.

We can't force people to use layout leases - at all, let alone
correctly - but if you want filesystems and enterprise distros to
support direct access to filesystem controlled storage, then direct
access applications need to follow a sane set of rules that are
supportable at the filesystem level.

> But the fact that RDMA, and potentially others, can "pass the
> pins" to other processes is something I spent a lot of time trying to work 
> out.

There's nothing in file layout lease architecture that says you
can't "pass the pins" to another process.  All the file layout lease
requirements say is that if you are going to pass a resource for
which the layout lease guarantees access for to another process,
then the destination process already have a valid, active layout
lease that covers the range of the pins being passed to it via the
RDMA handle.

i.e. as the pins pass from one process to another, they pass from
the protection of the lease process A holds to the protection that
the lease process B holds. This can probably even be done by
duplicating the lease fd and passing it by SCM_RIGHTS first.

Cheers,

Dave.

-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ; -)

2019-08-22 Thread Dave Chinner
On Wed, Aug 21, 2019 at 11:02:00AM -0700, Ira Weiny wrote:
> On Tue, Aug 20, 2019 at 08:55:15AM -0300, Jason Gunthorpe wrote:
> > On Tue, Aug 20, 2019 at 11:12:10AM +1000, Dave Chinner wrote:
> > > On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote:
> > > > On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote:
> > > > 
> > > > > So that leaves just the normal close() syscall exit case, where the
> > > > > application has full control of the order in which resources are
> > > > > released. We've already established that we can block in this
> > > > > context.  Blocking in an interruptible state will allow fatal signal
> > > > > delivery to wake us, and then we fall into the
> > > > > fatal_signal_pending() case if we get a SIGKILL while blocking.
> > > > 
> > > > The major problem with RDMA is that it doesn't always wait on close() 
> > > > for the
> > > > MR holding the page pins to be destoyed. This is done to avoid a
> > > > deadlock of the form:
> > > > 
> > > >uverbs_destroy_ufile_hw()
> > > >   mutex_lock()
> > > >[..]
> > > > mmput()
> > > >  exit_mmap()
> > > >   remove_vma()
> > > >fput();
> > > > file_operations->release()
> > > 
> > > I think this is wrong, and I'm pretty sure it's an example of why
> > > the final __fput() call is moved out of line.
> > 
> > Yes, I think so too, all I can say is this *used* to happen, as we
> > have special code avoiding it, which is the code that is messing up
> > Ira's lifetime model.
> > 
> > Ira, you could try unraveling the special locking, that solves your
> > lifetime issues?
> 
> Yes I will try to prove this out...  But I'm still not sure this fully solves
> the problem.
> 
> This only ensures that the process which has the RDMA context (RDMA FD) is 
> safe
> with regard to hanging the close for the "data file FD" (the file which has
> pinned pages) in that _same_ process.  But what about the scenario.
> 
> Process A has the RDMA context FD and data file FD (with lease) open.
> 
> Process A uses SCM_RIGHTS to pass the RDMA context FD to Process B.

Passing the RDMA context dependent on a file layout lease to another
process that doesn't have a file layout lease or a reference to the
original lease should be considered a violation of the layout lease.
Process B does not have an active layout lease, and so by the rules
of layout leases, it is not allowed to pin the layout of the file.

> Process A attempts to exit (hangs because data file FD is pinned).
> 
> Admin kills process A.  kill works because we have allowed for it...
> 
> Process B _still_ has the RDMA context FD open _and_ therefore still holds the
> file pins.
> 
> Truncation still fails.
> 
> Admin does not know which process is holding the pin.
> 
> What am I missing?

Application does not hold the correct file layout lease references.
Passing the fd via SCM_RIGHTS to a process without a layout lease
is equivalent to not using layout leases in the first place.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ; -)

2019-08-19 Thread Dave Chinner
On Mon, Aug 19, 2019 at 08:09:33PM -0700, John Hubbard wrote:
> On 8/19/19 6:20 PM, Dave Chinner wrote:
> > On Mon, Aug 19, 2019 at 05:05:53PM -0700, John Hubbard wrote:
> > > On 8/19/19 2:24 AM, Dave Chinner wrote:
> > > > On Mon, Aug 19, 2019 at 08:34:12AM +0200, Jan Kara wrote:
> > > > > On Sat 17-08-19 12:26:03, Dave Chinner wrote:
> > > > > > On Fri, Aug 16, 2019 at 12:05:28PM -0700, Ira Weiny wrote:
> > > > > > > On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote:
> > > > > > > > On Wed 14-08-19 11:08:49, Ira Weiny wrote:
> > > > > > > > > On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:
> > > ...
> > > 
> > > Any thoughts about sockets? I'm looking at net/xdp/xdp_umem.c which pins
> > > memory with FOLL_LONGTERM, and wondering how to make that work here.
> > 
> > I'm not sure how this interacts with file mappings? I mean, this
> > is just pinning anonymous pages for direct data placement into
> > userspace, right?
> > 
> > Are you asking "what if this pinned memory was a file mapping?",
> > or something else?
> 
> Yes, mainly that one. Especially since the FOLL_LONGTERM flag is
> already there in xdp_umem_pin_pages(), unconditionally. So the
> simple rules about struct *vaddr_pin usage (set it to NULL if FOLL_LONGTERM is
> not set) are not going to work here.
> 
> 
> > 
> > > These are close to files, in how they're handled, but just different
> > > enough that it's not clear to me how to make work with this system.
> > 
> > I'm guessing that if they are pinning a file backed mapping, they
> > are trying to dma direct to the file (zero copy into page cache?)
> > and so they'll need to either play by ODP rules or take layout
> > leases, too
> > 
> 
> OK. I was just wondering if there was some simple way to dig up a
> struct file associated with a socket (I don't think so), but it sounds
> like this is an exercise that's potentially different for each subsystem.

AFAIA, there is no struct file here - the memory that has been pinned
is just something mapped into the application's address space.

It seems to me that the socket here is equivalent of the RDMA handle
that that owns the hardware that pins the pages. Again, that RDMA
handle is not aware of waht the mapping represents, hence need to
hold a layout lease if it's a file mapping.

SO from the filesystem persepctive, there's no difference between
XDP or RDMA - if it's a FSDAX mapping then it is DMAing directly
into the filesystem's backing store and that will require use of
layout leases to perform safely.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ; -)

2019-08-19 Thread Dave Chinner
On Mon, Aug 19, 2019 at 05:05:53PM -0700, John Hubbard wrote:
> On 8/19/19 2:24 AM, Dave Chinner wrote:
> > On Mon, Aug 19, 2019 at 08:34:12AM +0200, Jan Kara wrote:
> > > On Sat 17-08-19 12:26:03, Dave Chinner wrote:
> > > > On Fri, Aug 16, 2019 at 12:05:28PM -0700, Ira Weiny wrote:
> > > > > On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote:
> > > > > > On Wed 14-08-19 11:08:49, Ira Weiny wrote:
> > > > > > > On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:
> ...
> > The last close is an interesting case because the __fput() call
> > actually runs from task_work() context, not where the last reference
> > is actually dropped. So it already has certain specific interactions
> > with signals and task exit processing via task_add_work() and
> > task_work_run().
> > 
> > task_add_work() calls set_notify_resume(task), so if nothing else
> > triggers when returning to userspace we run this path:
> > 
> > exit_to_usermode_loop()
> >tracehook_notify_resume()
> >  task_work_run()
> >__fput()
> > locks_remove_file()
> >   locks_remove_lease()
> > 
> > 
> > It's worth noting that locks_remove_lease() does a
> > percpu_down_read() which means we can already block in this context
> > removing leases
> > 
> > If there is a signal pending, the task work is run this way (before
> > the above notify path):
> > 
> > exit_to_usermode_loop()
> >do_signal()
> >  get_signal()
> >task_work_run()
> >  __fput()
> > 
> > We can detect this case via signal_pending() and even SIGKILL via
> > fatal_signal_pending(), and so we can decide not to block based on
> > the fact the process is about to be reaped and so the lease largely
> > doesn't matter anymore. I'd argue that it is close and we can't
> > easily back out, so we'd only break the block on a fatal signal
> > 
> > And then, of course, is the call path through do_exit(), which has
> > the PF_EXITING task flag set:
> > 
> > do_exit()
> >exit_task_work()
> >  task_work_run()
> >__fput()
> > 
> > and so it's easy to avoid blocking in this case, too.
> 
> Any thoughts about sockets? I'm looking at net/xdp/xdp_umem.c which pins
> memory with FOLL_LONGTERM, and wondering how to make that work here.

I'm not sure how this interacts with file mappings? I mean, this
is just pinning anonymous pages for direct data placement into
userspace, right?

Are you asking "what if this pinned memory was a file mapping?",
or something else?

> These are close to files, in how they're handled, but just different
> enough that it's not clear to me how to make work with this system.

I'm guessing that if they are pinning a file backed mapping, they
are trying to dma direct to the file (zero copy into page cache?)
and so they'll need to either play by ODP rules or take layout
leases, too

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ; -)

2019-08-19 Thread Dave Chinner
On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote:
> On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote:
> 
> > So that leaves just the normal close() syscall exit case, where the
> > application has full control of the order in which resources are
> > released. We've already established that we can block in this
> > context.  Blocking in an interruptible state will allow fatal signal
> > delivery to wake us, and then we fall into the
> > fatal_signal_pending() case if we get a SIGKILL while blocking.
> 
> The major problem with RDMA is that it doesn't always wait on close() for the
> MR holding the page pins to be destoyed. This is done to avoid a
> deadlock of the form:
> 
>uverbs_destroy_ufile_hw()
>   mutex_lock()
>[..]
> mmput()
>  exit_mmap()
>   remove_vma()
>fput();
> file_operations->release()

I think this is wrong, and I'm pretty sure it's an example of why
the final __fput() call is moved out of line.

fput()
  fput_many()
task_add_work(f, __fput())

and the call chain ends there.

Before the syscall returns to userspace, it then runs the __fput()
call through the task_work_run() interfaces, and hence the call
chain is just:

task_work_run
  __fput
> file_operations->release()
>  ib_uverbs_close()
>   uverbs_destroy_ufile_hw()
>mutex_lock()   <-- Deadlock

And there is no deadlock because nothing holds the mutex at this
point.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ; -)

2019-08-19 Thread Dave Chinner
On Mon, Aug 19, 2019 at 08:34:12AM +0200, Jan Kara wrote:
> On Sat 17-08-19 12:26:03, Dave Chinner wrote:
> > On Fri, Aug 16, 2019 at 12:05:28PM -0700, Ira Weiny wrote:
> > > On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote:
> > > > On Wed 14-08-19 11:08:49, Ira Weiny wrote:
> > > > > On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:
> > > 2) Second reason is that I thought I did not have a good way to tell if 
> > > the
> > >lease was actually in use.  What I mean is that letting the lease go 
> > > should
> > >be ok IFF we don't have any pins...  I was thinking that without 
> > > John's code
> > >we don't have a way to know if there are any pins...  But that is 
> > > wrong...
> > >All we have to do is check
> > > 
> > >   !list_empty(file->file_pins)
> > > 
> > > So now with this detail I think you are right, we should be able to hold 
> > > the
> > > lease through the struct file even if the process no longer has any
> > > "references" to it (ie closes and munmaps the file).
> > 
> > I really, really dislike the idea of zombie layout leases. It's a
> > nasty hack for poor application behaviour. This is a "we allow use
> > after layout lease release" API, and I think encoding largely
> > untraceable zombie objects into an API is very poor design.
> > 
> > From the fcntl man page:
> > 
> > LEASES
> > Leases are associated with an open file description (see
> > open(2)).  This means that duplicate file descriptors
> > (created by, for example, fork(2) or dup(2))  re‐ fer  to
> > the  same  lease,  and this lease may be modified or
> > released using any of these descriptors.  Furthermore, the
> > lease is released by either an explicit F_UNLCK operation on
> > any of these duplicate file descriptors, or when all such
> > file descriptors have been closed.
> > 
> > Leases are associated with *open* file descriptors, not the
> > lifetime of the struct file in the kernel. If the application closes
> > the open fds that refer to the lease, then the kernel does not
> > guarantee, and the application has no right to expect, that the
> > lease remains active in any way once the application closes all
> > direct references to the lease.
> > 
> > IOWs, applications using layout leases need to hold the lease fd
> > open for as long as the want access to the physical file layout. It
> > is a also a requirement of the layout lease that the holder releases
> > the resources it holds on the layout before it releases the layout
> > lease, exclusive lease or not. Closing the fd indicates they do not
> > need access to the file any more, and so the lease should be
> > reclaimed at that point.
> > 
> > I'm of a mind to make the last close() on a file block if there's an
> > active layout lease to prevent processes from zombie-ing layout
> > leases like this. i.e. you can't close the fd until resources that
> > pin the lease have been released.
> 
> Yeah, so this was my initial though as well [1]. But as the discussion in
> that thread revealed, the problem with blocking last close is that kernel
> does not really expect close to block. You could easily deadlock e.g. if
> the process gets SIGKILL, file with lease has fd 10, and the RDMA context
> holding pages pinned has fd 15.

Sure, I did think about this a bit about it before suggesting it :)

The last close is an interesting case because the __fput() call
actually runs from task_work() context, not where the last reference
is actually dropped. So it already has certain specific interactions
with signals and task exit processing via task_add_work() and
task_work_run().

task_add_work() calls set_notify_resume(task), so if nothing else
triggers when returning to userspace we run this path:

exit_to_usermode_loop()
  tracehook_notify_resume()
task_work_run()
  __fput()
locks_remove_file()
  locks_remove_lease()


It's worth noting that locks_remove_lease() does a
percpu_down_read() which means we can already block in this context
removing leases

If there is a signal pending, the task work is run this way (before
the above notify path):

exit_to_usermode_loop()
  do_signal()
get_signal()
  task_work_run()
__fput()

We can detect this case via signal_pending() and even SIGKILL via
fatal_signal_pending(), and so we can decide not to block based on
the fact the process is about to be reaped and so the lease largely
doesn't matter anymore. I'd argue that it is close and we can't
easily back out, so w

Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ; -)

2019-08-16 Thread Dave Chinner
On Fri, Aug 16, 2019 at 12:05:28PM -0700, Ira Weiny wrote:
> On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote:
> > On Wed 14-08-19 11:08:49, Ira Weiny wrote:
> > > On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:
> 2) Second reason is that I thought I did not have a good way to tell if the
>lease was actually in use.  What I mean is that letting the lease go should
>be ok IFF we don't have any pins...  I was thinking that without John's 
> code
>we don't have a way to know if there are any pins...  But that is wrong...
>All we have to do is check
> 
>   !list_empty(file->file_pins)
> 
> So now with this detail I think you are right, we should be able to hold the
> lease through the struct file even if the process no longer has any
> "references" to it (ie closes and munmaps the file).

I really, really dislike the idea of zombie layout leases. It's a
nasty hack for poor application behaviour. This is a "we allow use
after layout lease release" API, and I think encoding largely
untraceable zombie objects into an API is very poor design.

From the fcntl man page:

LEASES
Leases are associated with an open file description (see
open(2)).  This means that duplicate file descriptors
(created by, for example, fork(2) or dup(2))  re‐ fer  to
the  same  lease,  and this lease may be modified or
released using any of these descriptors.  Furthermore, the
lease is released by either an explicit F_UNLCK operation on
any of these duplicate file descriptors, or when all such
file descriptors have been closed.

Leases are associated with *open* file descriptors, not the
lifetime of the struct file in the kernel. If the application closes
the open fds that refer to the lease, then the kernel does not
guarantee, and the application has no right to expect, that the
lease remains active in any way once the application closes all
direct references to the lease.

IOWs, applications using layout leases need to hold the lease fd
open for as long as the want access to the physical file layout. It
is a also a requirement of the layout lease that the holder releases
the resources it holds on the layout before it releases the layout
lease, exclusive lease or not. Closing the fd indicates they do not
need access to the file any more, and so the lease should be
reclaimed at that point.

I'm of a mind to make the last close() on a file block if there's an
active layout lease to prevent processes from zombie-ing layout
leases like this. i.e. you can't close the fd until resources that
pin the lease have been released.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC PATCH v2 02/19] fs/locks: Add Exclusive flag to user Layout lease

2019-08-14 Thread Dave Chinner
On Wed, Aug 14, 2019 at 10:15:06AM -0400, Jeff Layton wrote:
> On Fri, 2019-08-09 at 15:58 -0700, ira.we...@intel.com wrote:
> > From: Ira Weiny 
> > 
> > Add an exclusive lease flag which indicates that the layout mechanism
> > can not be broken.
> > 
> > Exclusive layout leases allow the file system to know that pages may be
> > GUP pined and that attempts to change the layout, ie truncate, should be
> > failed.
> > 
> > A process which attempts to break it's own exclusive lease gets an
> > EDEADLOCK return to help determine that this is likely a programming bug
> > vs someone else holding a resource.
.
> > diff --git a/include/uapi/asm-generic/fcntl.h 
> > b/include/uapi/asm-generic/fcntl.h
> > index baddd54f3031..88b175ceccbc 100644
> > --- a/include/uapi/asm-generic/fcntl.h
> > +++ b/include/uapi/asm-generic/fcntl.h
> > @@ -176,6 +176,8 @@ struct f_owner_ex {
> >  
> >  #define F_LAYOUT   16  /* layout lease to allow longterm pins such as
> >RDMA */
> > +#define F_EXCLUSIVE32  /* layout lease is exclusive */
> > +   /* FIXME or shoudl this be F_EXLCK??? */
> >  
> >  /* operations for bsd flock(), also used by the kernel implementation */
> >  #define LOCK_SH1   /* shared lock */
> 
> This interface just seems weird to me. The existing F_*LCK values aren't
> really set up to be flags, but are enumerated values (even if there are
> some gaps on some arches). For instance, on parisc and sparc:

I don't think we need to worry about this - the F_WRLCK version of
the layout lease should have these exclusive access semantics (i.e
other ops fail rather than block waiting for lease recall) and hence
the API shouldn't need a new flag to specify them.

i.e. the primary difference between F_RDLCK and F_WRLCK layout
leases is that the F_RDLCK is a shared, co-operative lease model
where only delays in operations will be seen, while F_WRLCK is a
"guarantee exclusive access and I don't care what it breaks"
model... :)

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC PATCH v2 01/19] fs/locks: Export F_LAYOUT lease to user space

2019-08-14 Thread Dave Chinner
On Wed, Aug 14, 2019 at 07:21:34AM -0400, Jeff Layton wrote:
> On Wed, 2019-08-14 at 18:05 +1000, Dave Chinner wrote:
> > On Mon, Aug 12, 2019 at 10:36:26AM -0700, Ira Weiny wrote:
> > > On Sat, Aug 10, 2019 at 09:52:31AM +1000, Dave Chinner wrote:
> > > > On Fri, Aug 09, 2019 at 03:58:15PM -0700, ira.we...@intel.com wrote:
> > > > > + /*
> > > > > +  * NOTE on F_LAYOUT lease
> > > > > +  *
> > > > > +  * LAYOUT lease types are taken on files which the user knows 
> > > > > that
> > > > > +  * they will be pinning in memory for some indeterminate amount 
> > > > > of
> > > > > +  * time.
> > > > 
> > > > Indeed, layout leases have nothing to do with pinning of memory.
> > > 
> > > Yep, Fair enough.  I'll rework the comment.
> > > 
> > > > That's something an application taht uses layout leases might do,
> > > > but it largely irrelevant to the functionality layout leases
> > > > provide. What needs to be done here is explain what the layout lease
> > > > API actually guarantees w.r.t. the physical file layout, not what
> > > > some application is going to do with a lease. e.g.
> > > > 
> > > > The layout lease F_RDLCK guarantees that the holder will be
> > > > notified that the physical file layout is about to be
> > > > changed, and that it needs to release any resources it has
> > > > over the range of this lease, drop the lease and then
> > > > request it again to wait for the kernel to finish whatever
> > > > it is doing on that range.
> > > > 
> > > > The layout lease F_RDLCK also allows the holder to modify
> > > > the physical layout of the file. If an operation from the
> > > > lease holder occurs that would modify the layout, that lease
> > > > holder does not get notification that a change will occur,
> > > > but it will block until all other F_RDLCK leases have been
> > > > released by their holders before going ahead.
> > > > 
> > > > If there is a F_WRLCK lease held on the file, then a F_RDLCK
> > > > holder will fail any operation that may modify the physical
> > > > layout of the file. F_WRLCK provides exclusive physical
> > > > modification access to the holder, guaranteeing nothing else
> > > > will change the layout of the file while it holds the lease.
> > > > 
> > > > The F_WRLCK holder can change the physical layout of the
> > > > file if it so desires, this will block while F_RDLCK holders
> > > > are notified and release their leases before the
> > > > modification will take place.
> > > > 
> > > > We need to define the semantics we expose to userspace first.
> 
> Absolutely.
> 
> > > 
> > > Agreed.  I believe I have implemented the semantics you describe above.  
> > > Do I
> > > have your permission to use your verbiage as part of reworking the 
> > > comment and
> > > commit message?
> > 
> > Of course. :)
> > 
> > Cheers,
> > 
> 
> I'll review this in more detail soon, but subsequent postings of the set
> should probably also go to linux-api mailing list. This is a significant
> API change. It might not also hurt to get the glibc folks involved here
> too since you'll probably want to add the constants to the headers there
> as well.

Sure, but lets first get it to the point where we have something
that is actually workable, much more complete and somewhat validated
with unit tests before we start involving too many people. Wide
review of prototype code isn't really a good use of resources given
how much it's probably going to change from here...

> Finally, consider going ahead and drafting a patch to the fcntl(2)
> manpage if you think you have the API mostly nailed down. This API is a
> little counterintuitive (i.e. you can change the layout with an F_RDLCK
> lease), so it will need to be very clearly documented. I've also found
> that when creating a new API, documenting it tends to help highlight its
> warts and areas where the behavior is not clearly defined.

I find writing unit tests for xfstests to validate the new APIs work
as intended finds far more problems with the API than writing the
documentation. :)

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC PATCH v2 01/19] fs/locks: Export F_LAYOUT lease to user space

2019-08-14 Thread Dave Chinner
On Mon, Aug 12, 2019 at 10:36:26AM -0700, Ira Weiny wrote:
> On Sat, Aug 10, 2019 at 09:52:31AM +1000, Dave Chinner wrote:
> > On Fri, Aug 09, 2019 at 03:58:15PM -0700, ira.we...@intel.com wrote:
> > > + /*
> > > +  * NOTE on F_LAYOUT lease
> > > +  *
> > > +  * LAYOUT lease types are taken on files which the user knows that
> > > +  * they will be pinning in memory for some indeterminate amount of
> > > +  * time.
> > 
> > Indeed, layout leases have nothing to do with pinning of memory.
> 
> Yep, Fair enough.  I'll rework the comment.
> 
> > That's something an application taht uses layout leases might do,
> > but it largely irrelevant to the functionality layout leases
> > provide. What needs to be done here is explain what the layout lease
> > API actually guarantees w.r.t. the physical file layout, not what
> > some application is going to do with a lease. e.g.
> > 
> > The layout lease F_RDLCK guarantees that the holder will be
> > notified that the physical file layout is about to be
> > changed, and that it needs to release any resources it has
> > over the range of this lease, drop the lease and then
> > request it again to wait for the kernel to finish whatever
> > it is doing on that range.
> > 
> > The layout lease F_RDLCK also allows the holder to modify
> > the physical layout of the file. If an operation from the
> > lease holder occurs that would modify the layout, that lease
> > holder does not get notification that a change will occur,
> > but it will block until all other F_RDLCK leases have been
> > released by their holders before going ahead.
> > 
> > If there is a F_WRLCK lease held on the file, then a F_RDLCK
> > holder will fail any operation that may modify the physical
> > layout of the file. F_WRLCK provides exclusive physical
> > modification access to the holder, guaranteeing nothing else
> > will change the layout of the file while it holds the lease.
> > 
> > The F_WRLCK holder can change the physical layout of the
> > file if it so desires, this will block while F_RDLCK holders
> > are notified and release their leases before the
> > modification will take place.
> > 
> > We need to define the semantics we expose to userspace first.
> 
> Agreed.  I believe I have implemented the semantics you describe above.  Do I
> have your permission to use your verbiage as part of reworking the comment and
> commit message?

Of course. :)

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC PATCH v2 07/19] fs/xfs: Teach xfs to use new dax_layout_busy_page()

2019-08-14 Thread Dave Chinner
On Mon, Aug 12, 2019 at 11:05:51AM -0700, Ira Weiny wrote:
> On Sat, Aug 10, 2019 at 09:30:37AM +1000, Dave Chinner wrote:
> > On Fri, Aug 09, 2019 at 03:58:21PM -0700, ira.we...@intel.com wrote:
> > > From: Ira Weiny 
> > > 
> > > dax_layout_busy_page() can now operate on a sub-range of the
> > > address_space provided.
> > > 
> > > Have xfs specify the sub range to dax_layout_busy_page()
> > 
> > Hmmm. I've got patches that change all these XFS interfaces to
> > support range locks. I'm not sure the way the ranges are passed here
> > is the best way to do it, and I suspect they aren't correct in some
> > cases, either
> > 
> > > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> > > index ff3c1fae5357..f0de5486f6c1 100644
> > > --- a/fs/xfs/xfs_iops.c
> > > +++ b/fs/xfs/xfs_iops.c
> > > @@ -1042,10 +1042,16 @@ xfs_vn_setattr(
> > >   xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
> > >   iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
> > >  
> > > - error = xfs_break_layouts(inode, , BREAK_UNMAP);
> > > - if (error) {
> > > - xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
> > > - return error;
> > > + if (iattr->ia_size < inode->i_size) {
> > > + loff_t  off = iattr->ia_size;
> > > + loff_t  len = inode->i_size - 
> > > iattr->ia_size;
> > > +
> > > + error = xfs_break_layouts(inode, , off, len,
> > > +   BREAK_UNMAP);
> > > + if (error) {
> > > + xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
> > > + return error;
> > > + }
> > 
> > This isn't right - truncate up still needs to break the layout on
> > the last filesystem block of the file,
> 
> I'm not following this?  From a user perspective they can't have done anything
> with the data beyond the EOF.  So isn't it safe to allow EOF to grow without
> changing the layout of that last block?


You're looking at this from the perspective of what RDMA page
pinning, not what the guarantees a filesystem has to provide layout
holders.

For example, truncate up has to zero the portion of the block beyond
EOF and that requires a data write. What happens if that block is a
shared extent and hence we have do a copy on write and alter the
file layout?

Or perhaps that tail block still has dirty data over it that is
marked for delayed allocation? Truncate up will have to write that
data to zero the delayed allocation extent that spans EOF, and hence
the truncate modifies the layout because it triggers allocation.

i.e. just because an operation does not change user data, it does
not mean that it will not change the file layout. There is a chance
that truncate up will modify the layout and so we need to break the
layout leases that span the range from the old size to the new
size...

> > and truncate down needs to
> > extend to "maximum file offset" because we remove all extents beyond
> > EOF on a truncate down.
> 
> Ok, I was trying to allow a user to extend the file without conflicts if they
> were to have a pin on the 'beginning' of the original file.

If we want to allow file extension under a layout lease, the lease
has to extend beyond EOF, otherwise the new section of the file is
not covered by a lease. If leases only extend to the existing
EOF, then once the new data is written and the file is extended,
then the lease owner needs to take a new lease on the range they
just wrote. SO the application ends up having to do write - lease
-write -lease -  so that it has leases covering the range of the
file it is extending into.

Much better it to define a lease that extends to max file offset,
such that it always covers they range past the existing EOF and
extending writes will automatically be covered. What this then does
is to trigger layout break notifications on file size change, either
by write, truncate, fallocate, without having to actually know or
track the exactly file size in the lease

> This sounds like
> you are saying that a layout lease must be dropped to do that?  In some ways I
> think I understand what you are driving at and I think I see how I may have
> been playing "fast and loose" with the strictness of the layout lease.  But
> from a user perspective if there is a part of the file which "does not exist"
> (beyond EOF) does it matter that the layout there may change?

Yes, it does, because userspace can directly manipulate the layout
beyond EOF via fallocate(). e.

Re: [RFC PATCH v2 01/19] fs/locks: Export F_LAYOUT lease to user space

2019-08-09 Thread Dave Chinner
On Fri, Aug 09, 2019 at 03:58:15PM -0700, ira.we...@intel.com wrote:
> From: Ira Weiny 
> 
> In order to support an opt-in policy for users to allow long term pins
> of FS DAX pages we need to export the LAYOUT lease to user space.
> 
> This is the first of 2 new lease flags which must be used to allow a
> long term pin to be made on a file.
> 
> After the complete series:
> 
> 0) Registrations to Device DAX char devs are not affected
> 
> 1) The user has to opt in to allowing page pins on a file with an exclusive
>layout lease.  Both exclusive and layout lease flags are user visible now.
> 
> 2) page pins will fail if the lease is not active when the file back page is
>encountered.
> 
> 3) Any truncate or hole punch operation on a pinned DAX page will fail.
> 
> 4) The user has the option of holding the lease or releasing it.  If they
>release it no other pin calls will work on the file.
> 
> 5) Closing the file is ok.
> 
> 6) Unmapping the file is ok
> 
> 7) Pins against the files are tracked back to an owning file or an owning mm
>depending on the internal subsystem needs.  With RDMA there is an owning
>file which is related to the pined file.
> 
> 8) Only RDMA is currently supported
> 
> 9) Truncation of pages which are not actively pinned nor covered by a lease
>will succeed.

This has nothing to do with layout leases or what they provide
access arbitration over. Layout leases have _nothing_ to do with
page pinning or RDMA - they arbitrate behaviour the file offset ->
physical block device mapping within the filesystem and the
behaviour that will occur when a specific lease is held.

The commit descripting needs to describe what F_LAYOUT actually
protects, when they'll get broken, etc, not how RDMA is going to use
it.

> @@ -2022,8 +2030,26 @@ static int do_fcntl_add_lease(unsigned int fd, struct 
> file *filp, long arg)
>   struct file_lock *fl;
>   struct fasync_struct *new;
>   int error;
> + unsigned int flags = 0;
> +
> + /*
> +  * NOTE on F_LAYOUT lease
> +  *
> +  * LAYOUT lease types are taken on files which the user knows that
> +  * they will be pinning in memory for some indeterminate amount of
> +  * time.

Indeed, layout leases have nothing to do with pinning of memory.
That's something an application taht uses layout leases might do,
but it largely irrelevant to the functionality layout leases
provide. What needs to be done here is explain what the layout lease
API actually guarantees w.r.t. the physical file layout, not what
some application is going to do with a lease. e.g.

The layout lease F_RDLCK guarantees that the holder will be
notified that the physical file layout is about to be
changed, and that it needs to release any resources it has
over the range of this lease, drop the lease and then
request it again to wait for the kernel to finish whatever
it is doing on that range.

The layout lease F_RDLCK also allows the holder to modify
the physical layout of the file. If an operation from the
lease holder occurs that would modify the layout, that lease
holder does not get notification that a change will occur,
but it will block until all other F_RDLCK leases have been
released by their holders before going ahead.

If there is a F_WRLCK lease held on the file, then a F_RDLCK
holder will fail any operation that may modify the physical
layout of the file. F_WRLCK provides exclusive physical
modification access to the holder, guaranteeing nothing else
will change the layout of the file while it holds the lease.

The F_WRLCK holder can change the physical layout of the
file if it so desires, this will block while F_RDLCK holders
are notified and release their leases before the
modification will take place.

We need to define the semantics we expose to userspace first.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC PATCH v2 08/19] fs/xfs: Fail truncate if page lease can't be broken

2019-08-09 Thread Dave Chinner
On Fri, Aug 09, 2019 at 03:58:22PM -0700, ira.we...@intel.com wrote:
> From: Ira Weiny 
> 
> If pages are under a lease fail the truncate operation.  We change the order 
> of
> lease breaks to directly fail the operation if the lease exists.
> 
> Select EXPORT_BLOCK_OPS for FS_DAX to ensure that xfs_break_lease_layouts() is
> defined for FS_DAX as well as pNFS.
> 
> Signed-off-by: Ira Weiny 
> ---
>  fs/Kconfig| 1 +
>  fs/xfs/xfs_file.c | 5 +++--
>  2 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/Kconfig b/fs/Kconfig
> index 14cd4abdc143..c10b91f92528 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -48,6 +48,7 @@ config FS_DAX
>   select DEV_PAGEMAP_OPS if (ZONE_DEVICE && !FS_DAX_LIMITED)
>   select FS_IOMAP
>   select DAX
> + select EXPORTFS_BLOCK_OPS
>   help
> Direct Access (DAX) can be used on memory-backed block devices.
> If the block device supports DAX and the filesystem supports DAX,

That looks wrong. If you require xfs_break_lease_layouts() outside
of pnfs context, then move the function in the XFS code base to a
file that is built in. It's only external dependency is on the
break_layout() function, and XFS already has other unconditional
direct calls to break_layout()...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC PATCH 0/7] xfs: add reflink & dedupe support for fsdax.

2019-08-04 Thread Dave Chinner
On Thu, Aug 01, 2019 at 09:37:04AM +0800, Shiyang Ruan wrote:
> 
> 
> On 8/1/19 4:33 AM, Goldwyn Rodrigues wrote:
> > On 19:49 31/07, Shiyang Ruan wrote:
> > > This patchset aims to take care of this issue to make reflink and dedupe
> > > work correctly in XFS.
> > > 
> > > It is based on Goldwyn's patchsets: "v4 Btrfs dax support" and "Btrfs
> > > iomap".  I picked up some patches related and made a few fix to make it
> > > basically works fine.
> > > 
> > > For dax framework:
> > >1. adapt to the latest change in iomap.
> > > 
> > > For XFS:
> > >1. report the source address and set IOMAP_COW type for those write
> > >   operations that need COW.
> > >2. update extent list at the end.
> > >3. add file contents comparison function based on dax framework.
> > >4. use xfs_break_layouts() to support dax.
> > 
> > Shiyang,
> > 
> > I think you used the older patches which does not contain the iomap changes
> > which would call iomap_begin() with two iomaps. I have it in the btrfs-iomap
> 
> Oh, Sorry for my carelessness.  This patchset is built on your "Btrfs
> iomap".  I didn't point it out in cover letter.
> 
> > branch and plan to update it today. It is built on v5.3-rcX, so it should
> > contain the changes which moves the iomap code to the different directory.
> > I will build the dax patches on top of that.
> > However, we are making a big dependency chain here
> Don't worry.  It's fine for me.  I'll follow your updates.

Hi Shiyang,

I'll wait for you to update your patches on top of the latest btrfs
patches before looking at this any futher. it would be good to get
a set of iomap infrastructure patches separated from the btrfs
patchsets so could have them both built from a common patchset.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: xfs quota test xfs/050 fails with dax mount option and "-d su=2m,sw=1" mkfs option

2019-07-28 Thread Dave Chinner
On Wed, Jul 24, 2019 at 05:43:17PM +0800, Murphy Zhou wrote:
> Hi,
> 
> As subject.
> 
> -d su=2m,sw=1 && -o dax  fail
> -d su=2m,sw=1 && NO dax  pass
> no su mkfs option && -o dax  pass
> no su mkfs option && NO dax  pass
> 
> On latest Linus tree. Reproduce every time.
> 
> Testing on older kernels are going on to see if it's a regression.
> 
> Is this failure expected ?

I'm not sure it's actually a failure at all. DAX does not do delayed
allocation, so if the write is aligned to sunit and at EOF it will
round the allocation up to a full stripe unit. IOWs, for this test
once the file size gets beyond sunit on DAX, writes will allocate in
sunit chunks.

And, well, xfs/050 has checks in it for extent size hints, and
notruns if:

[ $extsize -ge 512000 ] && \
_notrun "Extent size hint is too large ($extsize bytes)"

Because EDQUOT occurs when:

> + URK 99: 2097152 is out of range! [3481600,4096000]

the file has less than 3.5MB or > 4MB allocated to it, and for a
stripe unit of > 512k, EDQUOT will occur at  <3.5MB. That's what
we are seeing here - a 2MB allocation at offset 2MB is > 4096000
bytes, and so it gets EDQUOT at that point

IOWs, this looks like a test problem, not a code failure...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

2019-06-13 Thread Dave Chinner
On Thu, Jun 13, 2019 at 01:34:06PM -0700, Ira Weiny wrote:
> On Thu, Jun 13, 2019 at 10:55:52AM +1000, Dave Chinner wrote:
> > On Wed, Jun 12, 2019 at 04:30:24PM -0700, Ira Weiny wrote:
> > > On Wed, Jun 12, 2019 at 05:37:53AM -0700, Matthew Wilcox wrote:
> > > > On Sat, Jun 08, 2019 at 10:10:36AM +1000, Dave Chinner wrote:
> > > > > On Fri, Jun 07, 2019 at 11:25:35AM -0700, Ira Weiny wrote:
> > > > > > Are you suggesting that we have something like this from user space?
> > > > > > 
> > > > > > fcntl(fd, F_SETLEASE, F_LAYOUT | F_UNBREAKABLE);
> > > > > 
> > > > > Rather than "unbreakable", perhaps a clearer description of the
> > > > > policy it entails is "exclusive"?
> > > > > 
> > > > > i.e. what we are talking about here is an exclusive lease that
> > > > > prevents other processes from changing the layout. i.e. the
> > > > > mechanism used to guarantee a lease is exclusive is that the layout
> > > > > becomes "unbreakable" at the filesystem level, but the policy we are
> > > > > actually presenting to uses is "exclusive access"...
> > > > 
> > > > That's rather different from the normal meaning of 'exclusive' in the
> > > > context of locks, which is "only one user can have access to this at
> > > > a time".  As I understand it, this is rather more like a 'shared' or
> > > > 'read' lock.  The filesystem would be the one which wants an exclusive
> > > > lock, so it can modify the mapping of logical to physical blocks.
> > > > 
> > > > The complication being that by default the filesystem has an exclusive
> > > > lock on the mapping, and what we're trying to add is the ability for
> > > > readers to ask the filesystem to give up its exclusive lock.
> > > 
> > > This is an interesting view...
> > > 
> > > And after some more thought, exclusive does not seem like a good name for 
> > > this
> > > because technically F_WRLCK _is_ an exclusive lease...
> > > 
> > > In addition, the user does not need to take the "exclusive" write lease 
> > > to be
> > > notified of (broken by) an unexpected truncate.  A "read" lease is broken 
> > > by
> > > truncate.  (And "write" leases really don't do anything different WRT the
> > > interaction of the FS and the user app.  Write leases control "exclusive"
> > > access between other file descriptors.)
> > 
> > I've been assuming that there is only one type of layout lease -
> > there is no use case I've heard of for read/write layout leases, and
> > like you say there is zero difference in behaviour at the filesystem
> > level - they all have to be broken to allow a non-lease truncate to
> > proceed.
> > 
> > IMO, taking a "read lease" to be able to modify and write to the
> > underlying mapping of a file makes absolutely no sense at all.
> > IOWs, we're talking exaclty about a revokable layout lease vs an
> > exclusive layout lease here, and so read/write really doesn't match
> > the policy or semantics we are trying to provide.
> 
> I humbly disagree, at least depending on how you look at it...  :-D
> 
> The patches as they stand expect the user to take a "read" layout lease which
> indicates they are currently using "reading" the layout as is.
> They are not
> changing ("writing" to) the layout.

As I said in a another email in the thread, a layout lease does not
make the layout "read only". It just means the lease owner will be
notified when someone else is about to modify it. The lease owner
can modify the mapping themselves, and they will not get notified
about their own modifications.

> They then pin pages which locks parts of
> the layout and therefore they expect no "writers" to change the layout.

Except they can change the layout themselves. It's perfectly valid
to get a layout lease, write() from offset 0 to EOF and fsync() to
intiialise the file and allocate all the space in the file, then
mmap() it and hand to off to RMDA, all while holding the layout
lease.

> The "write" layout lease breaks the "read" layout lease indicating that the
> layout is being written to.

Layout leases do not work this way.

> In fact, this is what NFS does right now.  The lease it puts on the file is of
> "read" type.
> 
> nfs4layouts.c:
> static int
> nfsd4_layout_setlease(struct nfs4_layout_stateid *l

Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

2019-06-13 Thread Dave Chinner
On Thu, Jun 13, 2019 at 07:31:07PM -0700, Matthew Wilcox wrote:
> On Fri, Jun 14, 2019 at 12:09:21PM +1000, Dave Chinner wrote:
> > If the lease holder modifies the mapping in a way that causes it's
> > own internal state to screw up, then that's a bug in the lease
> > holder application.
> 
> Sounds like the lease semantics aren't the right ones for the longterm
> GUP users then.  The point of the longterm GUP is so the pages can be
> written to, and if the filesystem is going to move the pages around when
> they're written to, that just won't work.

And now we go full circle back to the constraints we decided on long
ago because we can't rely on demand paging RDMA hardware any time
soon to do everything we need to transparently support long-term GUP
on file-backed mappings. i.e.:

RDMA to file backed mappings must first preallocate and
write zeros to the range of the file they are mapping so
that the filesystem block mapping is complete and static for
the life of the RDMA mapping that will pin it.

IOWs, the layout lease will tell the RDMA application that the
static setup it has already done  to work correctly with a file
backed mapping may be about to be broken by a third party.

-Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

2019-06-13 Thread Dave Chinner
On Thu, Jun 13, 2019 at 01:34:05PM -0700, Ira Weiny wrote:
> On Thu, Jun 13, 2019 at 10:25:55AM +1000, Dave Chinner wrote:
> > On Wed, Jun 12, 2019 at 05:37:53AM -0700, Matthew Wilcox wrote:
> > > On Sat, Jun 08, 2019 at 10:10:36AM +1000, Dave Chinner wrote:
> > > > On Fri, Jun 07, 2019 at 11:25:35AM -0700, Ira Weiny wrote:
> > > > > Are you suggesting that we have something like this from user space?
> > > > > 
> > > > >   fcntl(fd, F_SETLEASE, F_LAYOUT | F_UNBREAKABLE);
> > > > 
> > > > Rather than "unbreakable", perhaps a clearer description of the
> > > > policy it entails is "exclusive"?
> > > > 
> > > > i.e. what we are talking about here is an exclusive lease that
> > > > prevents other processes from changing the layout. i.e. the
> > > > mechanism used to guarantee a lease is exclusive is that the layout
> > > > becomes "unbreakable" at the filesystem level, but the policy we are
> > > > actually presenting to uses is "exclusive access"...
> > > 
> > > That's rather different from the normal meaning of 'exclusive' in the
> > > context of locks, which is "only one user can have access to this at
> > > a time".
> > 
> > 
> > Layout leases are not locks, they are a user access policy object.
> > It is the process/fd which holds the lease and it's the process/fd
> > that is granted exclusive access.  This is exactly the same semantic
> > as O_EXCL provides for granting exclusive access to a block device
> > via open(), yes?
> > 
> > > As I understand it, this is rather more like a 'shared' or
> > > 'read' lock.  The filesystem would be the one which wants an exclusive
> > > lock, so it can modify the mapping of logical to physical blocks.
> > 
> > ISTM that you're conflating internal filesystem implementation with
> > application visible semantics. Yes, the filesystem uses internal
> > locks to serialise the modification of the things the lease manages
> > access too, but that has nothing to do with the access policy the
> > lease provides to users.
> > 
> > e.g. Process A has an exclusive layout lease on file F. It does an
> > IO to file F. The filesystem IO path checks that Process A owns the
> > lease on the file and so skips straight through layout breaking
> > because it owns the lease and is allowed to modify the layout. It
> > then takes the inode metadata locks to allocate new space and write
> > new data.
> > 
> > Process B now tries to write to file F. The FS checks whether
> > Process B owns a layout lease on file F. It doesn't, so then it
> > tries to break the layout lease so the IO can proceed. The layout
> > breaking code sees that process A has an exclusive layout lease
> > granted, and so returns -ETXTBSY to process B - it is not allowed to
> > break the lease and so the IO fails with -ETXTBSY.
> > 
> > i.e. the exclusive layout lease prevents other processes from
> > performing operations that may need to modify the layout from
> > performing those operations. It does not "lock" the file/inode in
> > any way, it just changes how the layout lease breaking behaves.
> 
> Question: Do we expect Process A to get notified that Process B was attempting
> to change the layout?

In which case?

In the non-exclusive case, yes, the lease gets
recalled and the application needs to play nice and release it's
references and drop the lease.

In the exclusive case, no. The application has said "I don't play
nice with others" and so we basically tell process B to get stuffed
and process A can continue onwards oblivious to the wreckage it
leaves behind

> This changes the exclusivity semantics.  While Process A has an exclusive 
> lease
> it could release it if notified to allow process B temporary exclusivity.

And then it's not an exclusive lease - it's just a normal layout
lease. Process B -does not need a lease- to write to the file.

All the layout lease does is provide notification to applications
that rely on the layout of the file being under their control that
someone else is about to modify the layout. The lease holder that
"plays nice" then releases the layout and drops it's lease, allowing
process B to begin it's operation.

Process A then immediately takes a new layout lease, and remaps the
file layout via FIEMAP or by creating a new RDMA MR for the mmap
region. THose operations get serialised by the filesystem because
the operation being run by process B is run atomically w.r.t. the
original lease being broken. Hence the new mapping t

Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

2019-06-13 Thread Dave Chinner
On Thu, Jun 13, 2019 at 08:45:30PM -0300, Jason Gunthorpe wrote:
> On Thu, Jun 13, 2019 at 02:13:21PM -0700, Ira Weiny wrote:
> > On Thu, Jun 13, 2019 at 08:27:55AM -0700, Matthew Wilcox wrote:
> > > On Thu, Jun 13, 2019 at 10:25:55AM +1000, Dave Chinner wrote:
> > > > e.g. Process A has an exclusive layout lease on file F. It does an
> > > > IO to file F. The filesystem IO path checks that Process A owns the
> > > > lease on the file and so skips straight through layout breaking
> > > > because it owns the lease and is allowed to modify the layout. It
> > > > then takes the inode metadata locks to allocate new space and write
> > > > new data.
> > > > 
> > > > Process B now tries to write to file F. The FS checks whether
> > > > Process B owns a layout lease on file F. It doesn't, so then it
> > > > tries to break the layout lease so the IO can proceed. The layout
> > > > breaking code sees that process A has an exclusive layout lease
> > > > granted, and so returns -ETXTBSY to process B - it is not allowed to
> > > > break the lease and so the IO fails with -ETXTBSY.
> > > 
> > > This description doesn't match the behaviour that RDMA wants either.
> > > Even if Process A has a lease on the file, an IO from Process A which
> > > results in blocks being freed from the file is going to result in the
> > > RDMA device being able to write to blocks which are now freed (and
> > > potentially reallocated to another file).
> > 
> > I don't understand why this would not work for RDMA?  As long as the layout
> > does not change the page pins can remain in place.
> 
> Because process A had a layout lease (and presumably a MR) and the
> layout was still modified in way that invalidates the RDMA MR.

The lease holder is allowed to modify the mapping it has a lease
over. That's necessary so lease holders can write data into
unallocated space in the file. The lease is there to prevent third
parties from modifying the layout without the lease holder being
informed and taking appropriate action to allow that 3rd party
modification to occur.

If the lease holder modifies the mapping in a way that causes it's
own internal state to screw up, then that's a bug in the lease
holder application.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

2019-06-12 Thread Dave Chinner
On Wed, Jun 12, 2019 at 04:30:24PM -0700, Ira Weiny wrote:
> On Wed, Jun 12, 2019 at 05:37:53AM -0700, Matthew Wilcox wrote:
> > On Sat, Jun 08, 2019 at 10:10:36AM +1000, Dave Chinner wrote:
> > > On Fri, Jun 07, 2019 at 11:25:35AM -0700, Ira Weiny wrote:
> > > > Are you suggesting that we have something like this from user space?
> > > > 
> > > > fcntl(fd, F_SETLEASE, F_LAYOUT | F_UNBREAKABLE);
> > > 
> > > Rather than "unbreakable", perhaps a clearer description of the
> > > policy it entails is "exclusive"?
> > > 
> > > i.e. what we are talking about here is an exclusive lease that
> > > prevents other processes from changing the layout. i.e. the
> > > mechanism used to guarantee a lease is exclusive is that the layout
> > > becomes "unbreakable" at the filesystem level, but the policy we are
> > > actually presenting to uses is "exclusive access"...
> > 
> > That's rather different from the normal meaning of 'exclusive' in the
> > context of locks, which is "only one user can have access to this at
> > a time".  As I understand it, this is rather more like a 'shared' or
> > 'read' lock.  The filesystem would be the one which wants an exclusive
> > lock, so it can modify the mapping of logical to physical blocks.
> > 
> > The complication being that by default the filesystem has an exclusive
> > lock on the mapping, and what we're trying to add is the ability for
> > readers to ask the filesystem to give up its exclusive lock.
> 
> This is an interesting view...
> 
> And after some more thought, exclusive does not seem like a good name for this
> because technically F_WRLCK _is_ an exclusive lease...
> 
> In addition, the user does not need to take the "exclusive" write lease to be
> notified of (broken by) an unexpected truncate.  A "read" lease is broken by
> truncate.  (And "write" leases really don't do anything different WRT the
> interaction of the FS and the user app.  Write leases control "exclusive"
> access between other file descriptors.)

I've been assuming that there is only one type of layout lease -
there is no use case I've heard of for read/write layout leases, and
like you say there is zero difference in behaviour at the filesystem
level - they all have to be broken to allow a non-lease truncate to
proceed.

IMO, taking a "read lease" to be able to modify and write to the
underlying mapping of a file makes absolutely no sense at all.
IOWs, we're talking exaclty about a revokable layout lease vs an
exclusive layout lease here, and so read/write really doesn't match
the policy or semantics we are trying to provide.

> Another thing to consider is that this patch set _allows_ a truncate/hole 
> punch
> to proceed _if_ the pages being affected are not actually pinned.  So the
> unbreakable/exclusive nature of the lease is not absolute.

If you're talking about the process that owns the layout lease
running the truncate, then that is fine.

However, if you are talking about a process that does not own the
layout lease being allowed to truncate a file without first breaking
the layout lease, then that is fundamentally broken.

i.e. If you don't own a layout lease, the layout leases must be
broken before the truncate can proceed. If it's an exclusive lease,
then you cannot break the lease and the truncate *must fail before
it is started*. i.e.  the layout lease state must be correctly
resolved before we start an operation that may modify a file layout.

Determining if we can actually do the truncate based on page state
occurs /after/ the lease says the truncate can proceed

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

2019-06-12 Thread Dave Chinner
On Wed, Jun 12, 2019 at 05:37:53AM -0700, Matthew Wilcox wrote:
> On Sat, Jun 08, 2019 at 10:10:36AM +1000, Dave Chinner wrote:
> > On Fri, Jun 07, 2019 at 11:25:35AM -0700, Ira Weiny wrote:
> > > Are you suggesting that we have something like this from user space?
> > > 
> > >   fcntl(fd, F_SETLEASE, F_LAYOUT | F_UNBREAKABLE);
> > 
> > Rather than "unbreakable", perhaps a clearer description of the
> > policy it entails is "exclusive"?
> > 
> > i.e. what we are talking about here is an exclusive lease that
> > prevents other processes from changing the layout. i.e. the
> > mechanism used to guarantee a lease is exclusive is that the layout
> > becomes "unbreakable" at the filesystem level, but the policy we are
> > actually presenting to uses is "exclusive access"...
> 
> That's rather different from the normal meaning of 'exclusive' in the
> context of locks, which is "only one user can have access to this at
> a time".


Layout leases are not locks, they are a user access policy object.
It is the process/fd which holds the lease and it's the process/fd
that is granted exclusive access.  This is exactly the same semantic
as O_EXCL provides for granting exclusive access to a block device
via open(), yes?

> As I understand it, this is rather more like a 'shared' or
> 'read' lock.  The filesystem would be the one which wants an exclusive
> lock, so it can modify the mapping of logical to physical blocks.

ISTM that you're conflating internal filesystem implementation with
application visible semantics. Yes, the filesystem uses internal
locks to serialise the modification of the things the lease manages
access too, but that has nothing to do with the access policy the
lease provides to users.

e.g. Process A has an exclusive layout lease on file F. It does an
IO to file F. The filesystem IO path checks that Process A owns the
lease on the file and so skips straight through layout breaking
because it owns the lease and is allowed to modify the layout. It
then takes the inode metadata locks to allocate new space and write
new data.

Process B now tries to write to file F. The FS checks whether
Process B owns a layout lease on file F. It doesn't, so then it
tries to break the layout lease so the IO can proceed. The layout
breaking code sees that process A has an exclusive layout lease
granted, and so returns -ETXTBSY to process B - it is not allowed to
break the lease and so the IO fails with -ETXTBSY.

i.e. the exclusive layout lease prevents other processes from
performing operations that may need to modify the layout from
performing those operations. It does not "lock" the file/inode in
any way, it just changes how the layout lease breaking behaves.

Further, the "exclusiveness" of a layout lease is completely
irrelevant to the filesystem that is indicating that an operation
that may need to modify the layout is about to be performed. All the
filesystem has to do is handle failures to break the lease
appropriately.  Yes, XFS serialises the layout lease validation
against other IO to the same file via it's IO locks, but that's an
internal data IO coherency requirement, not anything to do with
layout lease management.

Note that I talk about /writes/ here. This is interchangable with
any other operation that may need to modify the extent layout of the
file, be it truncate, fallocate, etc: the attempt to break the
layout lease by a non-owner should fail if the lease is "exclusive"
to the owner.

> The complication being that by default the filesystem has an exclusive
> lock on the mapping, and what we're trying to add is the ability for
> readers to ask the filesystem to give up its exclusive lock.

The filesystem doesn't even lock the "mapping" until after the
layout lease has been validated or broken.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

2019-06-07 Thread Dave Chinner
On Fri, Jun 07, 2019 at 11:25:35AM -0700, Ira Weiny wrote:
> On Fri, Jun 07, 2019 at 01:04:26PM +0200, Jan Kara wrote:
> > On Thu 06-06-19 15:03:30, Ira Weiny wrote:
> > > On Thu, Jun 06, 2019 at 12:42:03PM +0200, Jan Kara wrote:
> > > > On Wed 05-06-19 18:45:33, ira.we...@intel.com wrote:
> > > > > From: Ira Weiny 
> > > > 
> > > > So I'd like to actually mandate that you *must* hold the file lease 
> > > > until
> > > > you unpin all pages in the given range (not just that you have an 
> > > > option to
> > > > hold a lease). And I believe the kernel should actually enforce this. 
> > > > That
> > > > way we maintain a sane state that if someone uses a physical location of
> > > > logical file offset on disk, he has a layout lease. Also once this is 
> > > > done,
> > > > sysadmin has a reasonably easy way to discover run-away RDMA application
> > > > and kill it if he wishes so.
> > > 
> > > Fair enough.
> > > 
> > > I was kind of heading that direction but had not thought this far 
> > > forward.  I
> > > was exploring how to have a lease remain on the file even after a "lease
> > > break".  But that is incompatible with the current semantics of a "layout"
> > > lease (as currently defined in the kernel).  [In the end I wanted to get 
> > > an RFC
> > > out to see what people think of this idea so I did not look at keeping the
> > > lease.]
> > > 
> > > Also hitch is that currently a lease is forcefully broken after
> > > /lease-break-time.  To do what you suggest I think we would need a 
> > > new
> > > lease type with the semantics you describe.
> > 
> > I'd do what Dave suggested - add flag to mark lease as unbreakable by
> > truncate and teach file locking core to handle that. There actually is
> > support for locks that are not broken after given timeout so there
> > shouldn't be too many changes need.
> >  
> > > Previously I had thought this would be a good idea (for other reasons).  
> > > But
> > > what does everyone think about using a "longterm lease" similar to [1] 
> > > which
> > > has the semantics you proppose?  In [1] I was not sure "longterm" was a 
> > > good
> > > name but with your proposal I think it makes more sense.
> > 
> > As I wrote elsewhere in this thread I think FL_LAYOUT name still makes
> > sense and I'd add there FL_UNBREAKABLE to mark unusal behavior with
> > truncate.
> 
> Ok I want to make sure I understand what you and Dave are suggesting.
> 
> Are you suggesting that we have something like this from user space?
> 
>   fcntl(fd, F_SETLEASE, F_LAYOUT | F_UNBREAKABLE);

Rather than "unbreakable", perhaps a clearer description of the
policy it entails is "exclusive"?

i.e. what we are talking about here is an exclusive lease that
prevents other processes from changing the layout. i.e. the
mechanism used to guarantee a lease is exclusive is that the layout
becomes "unbreakable" at the filesystem level, but the policy we are
actually presenting to uses is "exclusive access"...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes

2019-05-28 Thread Dave Chinner
On Tue, May 28, 2019 at 09:07:19PM -0700, Darrick J. Wong wrote:
> On Wed, May 29, 2019 at 12:02:40PM +0800, Shiyang Ruan wrote:
> > On 5/29/19 10:47 AM, Dave Chinner wrote:
> > > On Wed, May 29, 2019 at 10:01:58AM +0800, Shiyang Ruan wrote:
> > > > On 5/28/19 5:17 PM, Jan Kara wrote:
> > > > > I'm sorry but I don't follow what you suggest. One COW operation is a 
> > > > > call
> > > > > to dax_iomap_rw(), isn't it? That may call iomap_apply() several 
> > > > > times,
> > > > > each invocation calls ->iomap_begin(), ->actor() (dax_iomap_actor()),
> > > > > ->iomap_end() once. So I don't see a difference between doing 
> > > > > something in
> > > > > ->actor() and ->iomap_end() (besides the passed arguments but that 
> > > > > does not
> > > > > seem to be your concern). So what do you exactly want to do?
> > > > 
> > > > Hi Jan,
> > > > 
> > > > Thanks for pointing out, and I'm sorry for my mistake.  It's
> > > > ->dax_iomap_rw(), not ->dax_iomap_actor().
> > > > 
> > > > I want to call the callback function at the end of ->dax_iomap_rw().
> > > > 
> > > > Like this:
> > > > dax_iomap_rw(..., callback) {
> > > > 
> > > >  ...
> > > >  while (...) {
> > > >  iomap_apply(...);
> > > >  }
> > > > 
> > > >  if (callback != null) {
> > > >  callback();
> > > >  }
> > > >  return ...;
> > > > }
> > > 
> > > Why does this need to be in dax_iomap_rw()?
> > > 
> > > We already do post-dax_iomap_rw() "io-end callbacks" directly in
> > > xfs_file_dax_write() to update the file size
> > 
> > Yes, but we also need to call ->xfs_reflink_end_cow() after a COW operation.
> > And an is-cow flag(from iomap) is also needed to determine if we call it.  I
> > think it would be better to put this into ->dax_iomap_rw() as a callback
> > function.
> 
> Sort of like how iomap_dio_rw takes a write endio function?

You mean like we originally had in the DAX code for unwritten
extents?

But we got rid of that because performance of unwritten extents was
absolutely woeful - it's cheaper in terms of CPU cost to do up front
zeroing (i.e. inside ->iomap_begin) than it is to use unwritten
extents and convert them to protect against stale data exposure.

I have a feeling that exactly the same thing is true for CoW - the
hoops we jump through to do COW fork manipulation and then extent
movement between the COW fork and the data fork on IO completion
would be better done before we commit the COW extent allocation.

In which case, what we actually want for DAX is:


 iomap_apply()

->iomap_begin()
map old data extent that we copy from

allocate new data extent we copy to in data fork,
immediately replacing old data extent

return transaction handle as private data

dax_iomap_actor()
copies data from old extent to new extent

->iomap_end
commits transaction now data has been copied, making
the COW operation atomic with the data copy.


This, in fact, should be how we do all DAX writes that require
allocation, because then we get rid of the need to zero newly
allocated or unwritten extents before we copy the data into it. i.e.
we only need to write once to newly allocated storage rather than
twice.

This gets rid of the need for COW callbacks, and means the DAX
reflink implementation does not need to use delalloc
speculative preallocation or COW forks at all.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes

2019-05-28 Thread Dave Chinner
On Wed, May 29, 2019 at 10:01:58AM +0800, Shiyang Ruan wrote:
> 
> On 5/28/19 5:17 PM, Jan Kara wrote:
> > On Mon 27-05-19 16:25:41, Shiyang Ruan wrote:
> > > On 5/23/19 7:51 PM, Goldwyn Rodrigues wrote:
> > > > > 
> > > > > Hi,
> > > > > 
> > > > > I'm working on reflink & dax in XFS, here are some thoughts on this:
> > > > > 
> > > > > As mentioned above: the second iomap's offset and length must match 
> > > > > the
> > > > > first.  I thought so at the beginning, but later found that the only
> > > > > difference between these two iomaps is @addr.  So, what about adding a
> > > > > @saddr, which means the source address of COW extent, into the struct 
> > > > > iomap.
> > > > > The ->iomap_begin() fills @saddr if the extent is COW, and 0 if not.  
> > > > > Then
> > > > > handle this @saddr in each ->actor().  No more modifications in other
> > > > > functions.
> > > > 
> > > > Yes, I started of with the exact idea before being recommended this by 
> > > > Dave.
> > > > I used two fields instead of one namely cow_pos and cow_addr which 
> > > > defined
> > > > the source details. I had put it as a iomap flag as opposed to a type
> > > > which of course did not appeal well.
> > > > 
> > > > We may want to use iomaps for cases where two inodes are involved.
> > > > An example of the other scenario where offset may be different is file
> > > > comparison for dedup: vfs_dedup_file_range_compare(). However, it would
> > > > need two inodes in iomap as well.
> > > > 
> > > Yes, it is reasonable.  Thanks for your explanation.
> > > 
> > > One more thing RFC:
> > > I'd like to add an end-io callback argument in ->dax_iomap_actor() to 
> > > update
> > > the metadata after one whole COW operation is completed.  The end-io can
> > > also be called in ->iomap_end().  But one COW operation may call
> > > ->iomap_apply() many times, and so does the end-io.  Thus, I think it 
> > > would
> > > be nice to move it to the bottom of ->dax_iomap_actor(), called just once 
> > > in
> > > each COW operation.
> > 
> > I'm sorry but I don't follow what you suggest. One COW operation is a call
> > to dax_iomap_rw(), isn't it? That may call iomap_apply() several times,
> > each invocation calls ->iomap_begin(), ->actor() (dax_iomap_actor()),
> > ->iomap_end() once. So I don't see a difference between doing something in
> > ->actor() and ->iomap_end() (besides the passed arguments but that does not
> > seem to be your concern). So what do you exactly want to do?
> 
> Hi Jan,
> 
> Thanks for pointing out, and I'm sorry for my mistake.  It's
> ->dax_iomap_rw(), not ->dax_iomap_actor().
> 
> I want to call the callback function at the end of ->dax_iomap_rw().
> 
> Like this:
> dax_iomap_rw(..., callback) {
> 
> ...
> while (...) {
> iomap_apply(...);
> }
> 
> if (callback != null) {
> callback();
> }
> return ...;
> }

Why does this need to be in dax_iomap_rw()?

We already do post-dax_iomap_rw() "io-end callbacks" directly in
xfs_file_dax_write() to update the file size

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes

2019-05-22 Thread Dave Chinner
On Wed, May 22, 2019 at 03:14:47PM -0500, Goldwyn Rodrigues wrote:
> On  9:51 21/05, Darrick J. Wong wrote:
> > On Mon, Apr 29, 2019 at 12:26:35PM -0500, Goldwyn Rodrigues wrote:
> > We cannot use @inline_data to convey the source address.  @inline_data
> > (so far) is used to point to the in-memory representation of the storage
> > described by @addr.  For data writes, @addr is the location of the write
> > on disk and @inline_data is the location of the write in memory.
> > 
> > Reusing @inline_data here to point to the location of the source data in
> > memory is a totally different thing and will likely result in confusion.
> > On a practical level, this also means that we cannot support the case of
> > COW && INLINE because the type codes collide and so would the users of
> > @inline_data.  This isn't required *right now*, but if you had a pmem
> > filesystem that stages inode updates in memory and flips a pointer to
> > commit changes then the ->iomap_begin function will need to convey two
> > pointers at once.
> > 
> > So this brings us back to Dave's suggestion during the V1 patchset
> > review that instead of adding more iomap flags/types and overloading
> > fields, we simply pass two struct iomaps into ->iomap_begin:
> 
> Actually, Dave is the one who suggested to perform it this way.
> https://patchwork.kernel.org/comment/22562195/

My first suggestion was to use two iomaps. This suggestion came
later, as a way of demonstrating that a different type could be used
to redefine what ->inline_data was used for, if people considered
that an acceptible solution.

What was apparent from other discussions in the thread you quote was
that using two iomaps looked to be the better, more general approach
to solving the iomap read-modify-write issue at hand.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v6 6/6] xfs: disable map_sync for async flush

2019-04-23 Thread Dave Chinner
On Tue, Apr 23, 2019 at 01:36:12PM +0530, Pankaj Gupta wrote:
> Dont support 'MAP_SYNC' with non-DAX files and DAX files
> with asynchronous dax_device. Virtio pmem provides
> asynchronous host page cache flush mechanism. We don't
> support 'MAP_SYNC' with virtio pmem and xfs.
> 
> Signed-off-by: Pankaj Gupta 
> ---
>  fs/xfs/xfs_file.c | 10 ++
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 1f2e2845eb76..0e59be018511 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1196,11 +1196,13 @@ xfs_file_mmap(
>   struct file *filp,
>   struct vm_area_struct *vma)
>  {
> - /*
> -  * We don't support synchronous mappings for non-DAX files. At least
> -  * until someone comes with a sensible use case.
> + struct dax_device *dax_dev = xfs_find_daxdev_for_inode
> + (file_inode(filp));
> +
> + /* We don't support synchronous mappings for non-DAX files and
> +  * for DAX files if underneath dax_device is not synchronous.
>*/

/*
 * This is the correct multi-line comment format. Please
 * update the patch to maintain the existing comment format.
 */

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 5/5] xfs: disable map_sync for async flush

2019-04-03 Thread Dave Chinner
On Wed, Apr 03, 2019 at 04:10:18PM +0530, Pankaj Gupta wrote:
> Virtio pmem provides asynchronous host page cache flush
> mechanism. we don't support 'MAP_SYNC' with virtio pmem 
> and xfs.
> 
> Signed-off-by: Pankaj Gupta 
> ---
>  fs/xfs/xfs_file.c | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 1f2e2845eb76..dced2eb8c91a 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1203,6 +1203,14 @@ xfs_file_mmap(
>   if (!IS_DAX(file_inode(filp)) && (vma->vm_flags & VM_SYNC))
>   return -EOPNOTSUPP;
>  
> + /* We don't support synchronous mappings with DAX files if
> +  * dax_device is not synchronous.
> +  */
> + if (IS_DAX(file_inode(filp)) && !dax_synchronous(
> + xfs_find_daxdev_for_inode(file_inode(filp))) &&
> + (vma->vm_flags & VM_SYNC))
> + return -EOPNOTSUPP;
> +
>   file_accessed(filp);
>   vma->vm_ops = _file_vm_ops;
>   if (IS_DAX(file_inode(filp)))

All this ad hoc IS_DAX conditional logic is getting pretty nasty.

xfs_file_mmap(

{
struct inode*inode = file_inode(filp);

if (vma->vm_flags & VM_SYNC) {
if (!IS_DAX(inode))
return -EOPNOTSUPP;
if (!dax_synchronous(xfs_find_daxdev_for_inode(inode))
return -EOPNOTSUPP;
}

file_accessed(filp);
vma->vm_ops = _file_vm_ops;
if (IS_DAX(inode))
vma->vm_flags |= VM_HUGEPAGE;
return 0;
}


Even better, factor out all the "MAP_SYNC supported" checks into a
helper so that the filesystem code just doesn't have to care about
the details of checking for DAX+MAP_SYNC support

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Hang / zombie process from Xarray page-fault conversion (bisected)

2019-03-14 Thread Dave Chinner
On Thu, Mar 14, 2019 at 12:34:51AM -0700, Dan Williams wrote:
> On Mon, Mar 11, 2019 at 9:38 PM Dave Chinner  wrote:
> >
> > On Mon, Mar 11, 2019 at 08:35:05PM -0700, Dan Williams wrote:
> > > On Mon, Mar 11, 2019 at 8:10 AM Matthew Wilcox  
> > > wrote:
> > > >
> > > > On Thu, Mar 07, 2019 at 10:16:17PM -0800, Dan Williams wrote:
> > > > > Hi Willy,
> > > > >
> > > > > We're seeing a case where RocksDB hangs and becomes defunct when
> > > > > trying to kill the process. v4.19 succeeds and v4.20 fails. Robert was
> > > > > able to bisect this to commit b15cd800682f "dax: Convert page fault
> > > > > handlers to XArray".
> > > > >
> > > > > I see some direct usage of xa_index and wonder if there are some more
> > > > > pmd fixups to do?
> > > > >
> > > > > Other thoughts?
> > > >
> > > > I don't see why killing a process would have much to do with PMD
> > > > misalignment.  The symptoms (hanging on a signal) smell much more like
> > > > leaving a locked entry in the tree.  Is this easy to reproduce?  Can you
> > > > get /proc/$pid/stack for a hung task?
> > >
> > > It's fairly easy to reproduce, I'll see if I can package up all the
> > > dependencies into something that fails in a VM.
> > >
> > > It's limited to xfs, no failure on ext4 to date.
> > >
> > > The hung process appears to be:
> > >
> > >  kworker/53:1-xfs-sync/pmem0
> >
> > That's completely internal to XFS. Every 30s the work is triggered
> > and it either does a log flush (if the fs is active) or it syncs the
> > superblock to clean the log and idle the filesystem. It has nothing
> > to do with user processes, and I don't see why killing a process has
> > any effect on what it does...
> >
> > > ...and then the rest of the database processes grind to a halt from there.
> > >
> > > Robert was kind enough to capture /proc/$pid/stack, but nothing 
> > > interesting:
> > >
> > > [<0>] worker_thread+0xb2/0x380
> > > [<0>] kthread+0x112/0x130
> > > [<0>] ret_from_fork+0x1f/0x40
> > > [<0>] 0x
> >
> > Much more useful would be:
> >
> > # echo w > /proc/sysrq-trigger
> >
> > And post the entire output of dmesg.
> 
> Here it is:
> 
> https://gist.github.com/djbw/ca7117023305f325aca6f8ef30e11556

Which tells us nothing. :(

I think a bisect is in order...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Hang / zombie process from Xarray page-fault conversion (bisected)

2019-03-11 Thread Dave Chinner
On Mon, Mar 11, 2019 at 08:35:05PM -0700, Dan Williams wrote:
> On Mon, Mar 11, 2019 at 8:10 AM Matthew Wilcox  wrote:
> >
> > On Thu, Mar 07, 2019 at 10:16:17PM -0800, Dan Williams wrote:
> > > Hi Willy,
> > >
> > > We're seeing a case where RocksDB hangs and becomes defunct when
> > > trying to kill the process. v4.19 succeeds and v4.20 fails. Robert was
> > > able to bisect this to commit b15cd800682f "dax: Convert page fault
> > > handlers to XArray".
> > >
> > > I see some direct usage of xa_index and wonder if there are some more
> > > pmd fixups to do?
> > >
> > > Other thoughts?
> >
> > I don't see why killing a process would have much to do with PMD
> > misalignment.  The symptoms (hanging on a signal) smell much more like
> > leaving a locked entry in the tree.  Is this easy to reproduce?  Can you
> > get /proc/$pid/stack for a hung task?
> 
> It's fairly easy to reproduce, I'll see if I can package up all the
> dependencies into something that fails in a VM.
> 
> It's limited to xfs, no failure on ext4 to date.
> 
> The hung process appears to be:
> 
>  kworker/53:1-xfs-sync/pmem0

That's completely internal to XFS. Every 30s the work is triggered
and it either does a log flush (if the fs is active) or it syncs the
superblock to clean the log and idle the filesystem. It has nothing
to do with user processes, and I don't see why killing a process has
any effect on what it does...

> ...and then the rest of the database processes grind to a halt from there.
> 
> Robert was kind enough to capture /proc/$pid/stack, but nothing interesting:
> 
> [<0>] worker_thread+0xb2/0x380
> [<0>] kthread+0x112/0x130
> [<0>] ret_from_fork+0x1f/0x40
> [<0>] 0x

Much more useful would be:

# echo w > /proc/sysrq-trigger

And post the entire output of dmesg.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC PATCH] pmem: advertise page alignment for pmem devices supporting fsdax

2019-02-22 Thread Dave Chinner
On Fri, Feb 22, 2019 at 10:45:25AM -0800, Darrick J. Wong wrote:
> On Fri, Feb 22, 2019 at 10:28:15AM -0800, Dan Williams wrote:
> > On Fri, Feb 22, 2019 at 10:21 AM Darrick J. Wong
> >  wrote:
> > >
> > > Hi all!
> > >
> > > Uh, we have an internal customer  who's been trying out MAP_SYNC
> > > on pmem, and they've observed that one has to do a fair amount of
> > > legwork (in the form of mkfs.xfs parameters) to get the kernel to set up
> > > 2M PMD mappings.  They (of course) want to mmap hundreds of GB of pmem,
> > > so the PMD mappings are much more efficient.

Are you really saying that "mkfs.xfs -d su=2MB,sw=1 " is
considered "too much legwork" to set up the filesystem for DAX and
PMD alignment?


> > > I started poking around w.r.t. what mkfs.xfs was doing and realized that
> > > if the fsdax pmem device advertised iomin/ioopt of 2MB, then mkfs will
> > > set up all the parameters automatically.  Below is my ham-handed attempt
> > > to teach the kernel to do this.

Still need extent size hints so that writes that are smaller than
the PMD size are allocated correctly aligned and sized to map to
PMDs...

> > > Comments, flames, "WTF is this guy smoking?" are all welcome. :)
> > >
> > > --D
> > >
> > > ---
> > > Configure pmem devices to advertise the default page alignment when said
> > > block device supports fsdax.  Certain filesystems use these iomin/ioopt
> > > hints to try to create aligned file extents, which makes it much easier
> > > for mmaps to take advantage of huge page table entries.
> > >
> > > Signed-off-by: Darrick J. Wong 
> > > ---
> > >  drivers/nvdimm/pmem.c |5 -
> > >  1 file changed, 4 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> > > index bc2f700feef8..3eeb9dd117d5 100644
> > > --- a/drivers/nvdimm/pmem.c
> > > +++ b/drivers/nvdimm/pmem.c
> > > @@ -441,8 +441,11 @@ static int pmem_attach_disk(struct device *dev,
> > > blk_queue_logical_block_size(q, pmem_sector_size(ndns));
> > > blk_queue_max_hw_sectors(q, UINT_MAX);
> > > blk_queue_flag_set(QUEUE_FLAG_NONROT, q);
> > > -   if (pmem->pfn_flags & PFN_MAP)
> > > +   if (pmem->pfn_flags & PFN_MAP) {
> > > blk_queue_flag_set(QUEUE_FLAG_DAX, q);
> > > +   blk_queue_io_min(q, PFN_DEFAULT_ALIGNMENT);
> > > +   blk_queue_io_opt(q, PFN_DEFAULT_ALIGNMENT);
> > 
> > The device alignment might sometimes be bigger than this default.
> > Would there be any detrimental effects for filesystems if io_min and
> > io_opt were set to 1GB?
> 
> Hmmm, that's going to be a struggle on ext4 and the xfs data device
> because we'd be preferentially skipping the 1023.8MB immediately after
> each allocation group's metadata.  It already does this now with a 2MB
> io hint, but losing 1.8MB here and there isn't so bad.
> 
> We'd have to study it further, though; filesystems historically have
> interpreted the iomin/ioopt hints as RAID striping geometry, and I don't
> think very many people set up 1GB raid stripe units.

Setting sunit=1GB is really going to cause havoc with things like
inode chunk allocation alignment, and the first write() will either
have to be >=1GB or use 1GB extent size hints to trigger alignment.
And, AFAICT, it will prevent us from doing 2MB alignment on other
files, even with 2MB extent size hints set.

IOWs, I don't think 1GB alignment is a good idea as a default.

> (I doubt very many people have done 2M raid stripes either, but it seems
> to work easily where we've tried it...)

That's been pretty common with stacked hardware raid for as long as
I've worked on XFS. e.g. a software RAID0 stripe of hardware RAID5/6
luns was pretty common with large storage arrays in HPC environments
(i.e. huge streaming read/write bandwidth). In these cases, XFS was
set up with the RAID5/6 lun width as the stripe unit (commonly 2MB
with 8+1 and 256k raid chunk size), and the RAID 0
width as the stripe width (commonly 8-16 wide spread across 8-16 FC
portsi w/ multipath) and it wasn't uncommon to see widths in the
16-32MB range.

This aligned the filesystem to the underlying RAID5/6 luns, and
allows stripe width IO to be aligned an hit every RAID5/6 lun
evenly. Ensuring applications could do this easily with large direct
IO reads and writes is where the swalloc and largeio mount
options come into their own

> > I'm thinking and xfs-realtime configuration might be able to support
> > 1GB mappings in the future.
> 
> The x

Re: [RFC PATCH] pmem: advertise page alignment for pmem devices supporting fsdax

2019-02-22 Thread Dave Chinner
On Fri, Feb 22, 2019 at 10:20:08AM -0800, Darrick J. Wong wrote:
> Hi all!
> 
> Uh, we have an internal customer  who's been trying out MAP_SYNC
> on pmem, and they've observed that one has to do a fair amount of
> legwork (in the form of mkfs.xfs parameters) to get the kernel to set up
> 2M PMD mappings.  They (of course) want to mmap hundreds of GB of pmem,
> so the PMD mappings are much more efficient.
> 
> I started poking around w.r.t. what mkfs.xfs was doing and realized that
> if the fsdax pmem device advertised iomin/ioopt of 2MB, then mkfs will
> set up all the parameters automatically.  Below is my ham-handed attempt
> to teach the kernel to do this.

What's the before and after mkfs output?

(need to see the context that this "fixes" before I comment)

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [LSF/MM TOPIC] Software RAID Support for NV-DIMM

2019-02-18 Thread Dave Chinner
On Mon, Feb 18, 2019 at 06:15:34PM -0800, Jane Chu wrote:
> On 2/15/2019 9:39 PM, Dave Chinner wrote:
> 
> >On Sat, Feb 16, 2019 at 04:31:33PM +1100, Dave Chinner wrote:
> >>On Fri, Feb 15, 2019 at 10:57:12AM +0100, Johannes Thumshirn wrote:
> >>>(This is a joint proposal with Hannes Reinecke)
> >>>
> >>>Servers with NV-DIMM are slowly emerging in data centers but one key 
> >>>feature
> >>>for reliability of these systems hasn't been addressed up to now, data
> >>>redundancy.
> >>>
> >>>While it would be best to solve this issue in the memory controller of the 
> >>>CPU
> >>>itself, I don't see this coming in the next few years. This puts us as the 
> >>>OS
> >>>in the burden to create the redundant copies of data for the users.
> >>>
> >>>If we leave of the DAX support Linux' software RAID implementations (MD,
> >>>device-mapper and BTRFS RAID) do already work on top of pmem devices, but 
> >>>they
> >>>are incompatible with DAX.
> >>>
> >>>In this session Hannes and I would like to discuss eventual ways how we as 
> >>>an
> >>>operating system can mitigate these issues for our users.
> >>We've supported this since mid 2018 and commit ba23cba9b3bd ("fs:
> >>allow per-device dax status checking for filesystems"). That is,
> >>we can have DAX on the XFS RT device indepently of the data device.
> >>
> >>That is, you set up pmem in three segments - two small identical
> >>segments start get mirrored with RAID1 as the data device, and
> >>the remainder as a block device that is dax capable set up as the
> >>XFS realtime device. Set the RTINHERIT bit on the root directory at
> >>mkfs time ("-d rtinherit=1") and then all the data goes to the DAX
> >>capable realtime device, and all the metadata goes to the software
> >>raided pmem block devices that aren't DAX capable.
> >>
> >>Problem already solved, yes?
> >Sorry, this was meant to be a reply to Dan's email commenting about
> >some people needing mirrored metadata, not the parent that was
> >talking about whole device RAID...
> >
> >i.e. mirrored metadata w/ FS-DAX for data should already be a solved
> >problem...
> 
> What about DAX Ext4 filesystem? What could be done for the EXT4
> filesystem metadata?

AFAIK, this can't be supported with ext4 right now. You'll have to
either use hardware DIMM mirroring or wait to see if full device
mirroring can be done in software and still preserve DAX (which is
what the original proposal is about).

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [LSF/MM TOPIC] Software RAID Support for NV-DIMM

2019-02-15 Thread Dave Chinner
On Sat, Feb 16, 2019 at 04:31:33PM +1100, Dave Chinner wrote:
> On Fri, Feb 15, 2019 at 10:57:12AM +0100, Johannes Thumshirn wrote:
> > (This is a joint proposal with Hannes Reinecke)
> > 
> > Servers with NV-DIMM are slowly emerging in data centers but one key feature
> > for reliability of these systems hasn't been addressed up to now, data
> > redundancy.
> > 
> > While it would be best to solve this issue in the memory controller of the 
> > CPU
> > itself, I don't see this coming in the next few years. This puts us as the 
> > OS
> > in the burden to create the redundant copies of data for the users.
> > 
> > If we leave of the DAX support Linux' software RAID implementations (MD,
> > device-mapper and BTRFS RAID) do already work on top of pmem devices, but 
> > they
> > are incompatible with DAX.
> > 
> > In this session Hannes and I would like to discuss eventual ways how we as 
> > an
> > operating system can mitigate these issues for our users.
> 
> We've supported this since mid 2018 and commit ba23cba9b3bd ("fs:
> allow per-device dax status checking for filesystems"). That is,
> we can have DAX on the XFS RT device indepently of the data device.
> 
> That is, you set up pmem in three segments - two small identical
> segments start get mirrored with RAID1 as the data device, and
> the remainder as a block device that is dax capable set up as the
> XFS realtime device. Set the RTINHERIT bit on the root directory at
> mkfs time ("-d rtinherit=1") and then all the data goes to the DAX
> capable realtime device, and all the metadata goes to the software
> raided pmem block devices that aren't DAX capable.
> 
> Problem already solved, yes?

Sorry, this was meant to be a reply to Dan's email commenting about
some people needing mirrored metadata, not the parent that was
talking about whole device RAID...

i.e. mirrored metadata w/ FS-DAX for data should already be a solved
problem...

Cheers,

Dave.
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com
> 

-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [Lsf-pc] [LSF/MM TOPIC] The end of the DAX experiment

2019-02-15 Thread Dave Chinner
On Thu, Feb 14, 2019 at 12:20:11PM -0800, Dan Williams wrote:
> On Thu, Feb 14, 2019 at 12:09 PM Matthew Wilcox  wrote:
> >
> > On Thu, Feb 14, 2019 at 11:31:24AM -0800, Dan Williams wrote:
> > > On Thu, Feb 14, 2019 at 11:10 AM Jerome Glisse  wrote:
> > > > I am just again working on my struct page mapping patchset as well as
> > > > the generic page write protection that sits on top. I hope to be able
> > > > to post the v2 in couple weeks. You can always look at my posting last
> > > > year to see more details.
> > >
> > > Yes, I have that in mind as one of the contenders. However, it's not
> > > clear to me that its a suitable fit for filesystem-reflink. Others
> > > have floated the 'page proxy' idea, so it would be good to discuss the
> > > merits of the general approaches.
> >
> > ... and my preferred option of putting pfn entries in the page cache.
> 
> Another option to include the discussion.
> 
> > Or is that what you meant by "page proxy"?
> 
> Page proxy would be an object that a filesystem could allocate to
> point back to a single physical 'struct page *'. The proxy would
> contain an override for page->index.

Needs to override page->mapping, potentially the page flags, what
happens when someone takes a gup reference to the proxy structure,
etc?

Besides, didn't we discuss all this last year? how about we start
with a summary of all the options considered last year, the pros and
cons, etc, and then go from there? Regurgitating everything we've
already talked about last year doesn't seem particularly useful to
me...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [LSF/MM TOPIC] Software RAID Support for NV-DIMM

2019-02-15 Thread Dave Chinner
On Fri, Feb 15, 2019 at 10:57:12AM +0100, Johannes Thumshirn wrote:
> (This is a joint proposal with Hannes Reinecke)
> 
> Servers with NV-DIMM are slowly emerging in data centers but one key feature
> for reliability of these systems hasn't been addressed up to now, data
> redundancy.
> 
> While it would be best to solve this issue in the memory controller of the CPU
> itself, I don't see this coming in the next few years. This puts us as the OS
> in the burden to create the redundant copies of data for the users.
> 
> If we leave of the DAX support Linux' software RAID implementations (MD,
> device-mapper and BTRFS RAID) do already work on top of pmem devices, but they
> are incompatible with DAX.
> 
> In this session Hannes and I would like to discuss eventual ways how we as an
> operating system can mitigate these issues for our users.

We've supported this since mid 2018 and commit ba23cba9b3bd ("fs:
allow per-device dax status checking for filesystems"). That is,
we can have DAX on the XFS RT device indepently of the data device.

That is, you set up pmem in three segments - two small identical
segments start get mirrored with RAID1 as the data device, and
the remainder as a block device that is dax capable set up as the
XFS realtime device. Set the RTINHERIT bit on the root directory at
mkfs time ("-d rtinherit=1") and then all the data goes to the DAX
capable realtime device, and all the metadata goes to the software
raided pmem block devices that aren't DAX capable.

Problem already solved, yes?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [Qemu-devel] security implications of caching with virtio pmem (was Re: [PATCH v3 0/5] kvm "virtio pmem" device)

2019-02-11 Thread Dave Chinner
On Mon, Feb 11, 2019 at 02:29:46AM -0500, Pankaj Gupta wrote:
> Hello Dave,
> Are we okay with this?

Sure.

I'm not sure I agree with all the analysis presented, but, well, I
haven't looked any deeper because I'm tired of being shouted at and
being called argumentative for daring to ask hard questions about
this topic

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v3 0/5] kvm "virtio pmem" device

2019-01-15 Thread Dave Chinner
On Tue, Jan 15, 2019 at 12:35:06AM -0500, Pankaj Gupta wrote:
> 
> > > > On Mon, Jan 14, 2019 at 02:15:40AM -0500, Pankaj Gupta wrote:
> > > > >
> > > > > > > Until you have images (and hence host page cache) shared between
> > > > > > > multiple guests. People will want to do this, because it means 
> > > > > > > they
> > > > > > > only need a single set of pages in host memory for executable
> > > > > > > binaries rather than a set of pages per guest. Then you have
> > > > > > > multiple guests being able to detect residency of the same set of
> > > > > > > pages. If the guests can then, in any way, control eviction of the
> > > > > > > pages from the host cache, then we have a guest-to-guest
> > > > > > > information
> > > > > > > leak channel.
> > > > > >
> > > > > > I don't think we should ever be considering something that would
> > > > > > allow a
> > > > > > guest to evict page's from the host's pagecache [1].  The guest
> > > > > > should
> > > > > > be able to kick its own references to the host's pagecache out of 
> > > > > > its
> > > > > > own pagecache, but not be able to influence whether the host or
> > > > > > another
> > > > > > guest has a read-only mapping cached.
> > > > > >
> > > > > > [1] Unless the guest is allowed to modify the host's file; obviously
> > > > > > truncation, holepunching, etc are going to evict pages from the
> > > > > > host's
> > > > > > page cache.
> > > > >
> > > > > This is so correct. Guest does not not evict host page cache pages
> > > > > directly.
> > > >
> > > > They don't right now.
> > > >
> > > > But someone is going to end up asking for discard to work so that
> > > > the guest can free unused space in the underlying spares image (i.e.
> > > > make use of fstrim or mount -o discard) because they have workloads
> > > > that have bursts of space usage and they need to trim the image
> > > > files afterwards to keep their overall space usage under control.
> > > >
> > > > And then
> > > 
> > > ...we reject / push back on that patch citing the above concern.
> > 
> > So at what point do we draw the line?
> > 
> > We're allowing writable DAX mappings, but as I've pointed out that
> > means we are going to be allowing  a potential information leak via
> > files with shared extents to be directly mapped and written to.
> > 
> > But we won't allow useful admin operations that allow better
> > management of host side storage space similar to how normal image
> > files are used by guests because it's an information leak vector?
> 
> First of all Thank you for all the useful discussions. 
> I am summarizing here:
> 
> - We have to live with the limitation to not support fstrim and 
>   mount -o discard options with virtio-pmem as they will evict 
>   host page cache pages. We cannot allow this for virtio-pmem
>   for security reasons. These filesystem commands will just zero out 
>   unused pages currently.

Not sure I follow you here - what pages are going to be zeroed and
when will they be zeroed? If discard is not allowed, filesystems
just don't issue such commands and the underlying device will never
seen them.

> - If alot of space is unused and not freed guest can request host 
>   Administrator for truncating the host backing image. 

You can't use truncate to free space in a disk image file. The only
way to do it safely in a generic, filesystem agnositic way is to
mount the disk image (e.g. on loopback) and run fstrim on it. The
loopback device will punches holes in the file where all the free
space is reported by the filesystem via discard requests.

Which is kinda my point - this could only be done if the guest is
shut down, which makes it very difficult for admins to manage. 

>   We are also planning to support qcow2 sparse image format at 
>   host side with virtio-pmem.

So you're going to be remapping a huge number of disjoint regions
into a linear pmem mapping? ISTR discussions about similar things
for virtio+fuse+dax that came up against "large numbers of mapped
regions don't scale" and so it wasn't a practical solution compared
to a just using raw sparse files

> - There is no existing solution for Qemu persistent memory 
>   emulation with write support currently. This solution provides 
>   us the paravartualized way of emulating persistent memory.

Sure, but the question is why do you need to create an emulation
that doesn't actually perform like pmem? The whole point of pmem is
performance, and emulating pmem by mmap() of a file on spinning
disks is going to be horrible for performance. Even on SSDs it's
going to be orders of magnitudes slower than real pmem.

So exactly what problem are you trying to solve with this driver?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v3 0/5] kvm "virtio pmem" device

2019-01-14 Thread Dave Chinner
On Mon, Jan 14, 2019 at 01:35:57PM -0800, Dan Williams wrote:
> On Mon, Jan 14, 2019 at 1:25 PM Dave Chinner  wrote:
> >
> > On Mon, Jan 14, 2019 at 02:15:40AM -0500, Pankaj Gupta wrote:
> > >
> > > > > Until you have images (and hence host page cache) shared between
> > > > > multiple guests. People will want to do this, because it means they
> > > > > only need a single set of pages in host memory for executable
> > > > > binaries rather than a set of pages per guest. Then you have
> > > > > multiple guests being able to detect residency of the same set of
> > > > > pages. If the guests can then, in any way, control eviction of the
> > > > > pages from the host cache, then we have a guest-to-guest information
> > > > > leak channel.
> > > >
> > > > I don't think we should ever be considering something that would allow a
> > > > guest to evict page's from the host's pagecache [1].  The guest should
> > > > be able to kick its own references to the host's pagecache out of its
> > > > own pagecache, but not be able to influence whether the host or another
> > > > guest has a read-only mapping cached.
> > > >
> > > > [1] Unless the guest is allowed to modify the host's file; obviously
> > > > truncation, holepunching, etc are going to evict pages from the host's
> > > > page cache.
> > >
> > > This is so correct. Guest does not not evict host page cache pages 
> > > directly.
> >
> > They don't right now.
> >
> > But someone is going to end up asking for discard to work so that
> > the guest can free unused space in the underlying spares image (i.e.
> > make use of fstrim or mount -o discard) because they have workloads
> > that have bursts of space usage and they need to trim the image
> > files afterwards to keep their overall space usage under control.
> >
> > And then
> 
> ...we reject / push back on that patch citing the above concern.

So at what point do we draw the line?

We're allowing writable DAX mappings, but as I've pointed out that
means we are going to be allowing  a potential information leak via
files with shared extents to be directly mapped and written to.

But we won't allow useful admin operations that allow better
management of host side storage space similar to how normal image
files are used by guests because it's an information leak vector?

That's splitting some really fine hairs there...

> > > In case of virtio-pmem & DAX, guest clears guest page cache exceptional 
> > > entries.
> > > Its solely decision of host to take action on the host page cache pages.
> > >
> > > In case of virtio-pmem, guest does not modify host file directly i.e don't
> > > perform hole punch & truncation operation directly on host file.
> >
> > ... this will no longer be true, and the nuclear landmine in this
> > driver interface will have been armed
> 
> I agree with the need to be careful when / if explicit cache control
> is added, but that's not the case today.

"if"?

I expect it to be "when", not if. Expect the worst, plan for it now.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v3 0/5] kvm "virtio pmem" device

2019-01-14 Thread Dave Chinner
On Mon, Jan 14, 2019 at 02:15:40AM -0500, Pankaj Gupta wrote:
> 
> > > Until you have images (and hence host page cache) shared between
> > > multiple guests. People will want to do this, because it means they
> > > only need a single set of pages in host memory for executable
> > > binaries rather than a set of pages per guest. Then you have
> > > multiple guests being able to detect residency of the same set of
> > > pages. If the guests can then, in any way, control eviction of the
> > > pages from the host cache, then we have a guest-to-guest information
> > > leak channel.
> > 
> > I don't think we should ever be considering something that would allow a
> > guest to evict page's from the host's pagecache [1].  The guest should
> > be able to kick its own references to the host's pagecache out of its
> > own pagecache, but not be able to influence whether the host or another
> > guest has a read-only mapping cached.
> > 
> > [1] Unless the guest is allowed to modify the host's file; obviously
> > truncation, holepunching, etc are going to evict pages from the host's
> > page cache.
> 
> This is so correct. Guest does not not evict host page cache pages directly. 

They don't right now.

But someone is going to end up asking for discard to work so that
the guest can free unused space in the underlying spares image (i.e.
make use of fstrim or mount -o discard) because they have workloads
that have bursts of space usage and they need to trim the image
files afterwards to keep their overall space usage under control.

And then

> In case of virtio-pmem & DAX, guest clears guest page cache exceptional 
> entries.
> Its solely decision of host to take action on the host page cache pages.
> 
> In case of virtio-pmem, guest does not modify host file directly i.e don't
> perform hole punch & truncation operation directly on host file.  

... this will no longer be true, and the nuclear landmine in this
driver interface will have been armed

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v3 0/5] kvm "virtio pmem" device

2019-01-13 Thread Dave Chinner
On Sun, Jan 13, 2019 at 03:38:21PM -0800, Matthew Wilcox wrote:
> On Mon, Jan 14, 2019 at 10:29:02AM +1100, Dave Chinner wrote:
> > Until you have images (and hence host page cache) shared between
> > multiple guests. People will want to do this, because it means they
> > only need a single set of pages in host memory for executable
> > binaries rather than a set of pages per guest. Then you have
> > multiple guests being able to detect residency of the same set of
> > pages. If the guests can then, in any way, control eviction of the
> > pages from the host cache, then we have a guest-to-guest information
> > leak channel.
> 
> I don't think we should ever be considering something that would allow a
> guest to evict page's from the host's pagecache [1].  The guest should
> be able to kick its own references to the host's pagecache out of its
> own pagecache, but not be able to influence whether the host or another
> guest has a read-only mapping cached.
> 
> [1] Unless the guest is allowed to modify the host's file; obviously
> truncation, holepunching, etc are going to evict pages from the host's
> page cache.

Right, and that's exactly what I mean by "we need to be real careful
with functionality like this".

To be honest, I really don't think I've even touched the surface
here.

e.g. Filesystems and storage can share logical and physical extents.
Which means that image files that share storage (e.g.  because they
are all cloned from the same master image and/or there's in-line
deduplication running on the storage) and can be directly accessed
by guests may very well be susceptible to detection of host side
deduplication and subsequent copy-on-write operations.

This really doesn't seem much different to me from the guest being
able to infer host side KSM page deduplication and COW operation in
the guest side page cache.  The only difference is that DAX is being
used to probe the host side page cache and storage rather than the
guest side.

IOWs, I suspect there's a world of pain waiting for us if we punch
huge holes through the virtual machine abstractions like this.

Improving performance is a laudible goal, but at what price?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v3 0/5] kvm "virtio pmem" device

2019-01-13 Thread Dave Chinner
On Fri, Jan 11, 2019 at 02:45:04AM -0500, Pankaj Gupta wrote:
> 
> > 
> > On Wed, Jan 09, 2019 at 08:17:31PM +0530, Pankaj Gupta wrote:
> > >  This patch series has implementation for "virtio pmem".
> > >  "virtio pmem" is fake persistent memory(nvdimm) in guest
> > >  which allows to bypass the guest page cache. This also
> > >  implements a VIRTIO based asynchronous flush mechanism.
> > 
> > H. Sharing the host page cache direct into the guest VM. Sounds
> > like a good idea, but.
> > 
> > This means the guest VM can now run timing attacks to observe host
> > side page cache residency, and depending on the implementation I'm
> > guessing that the guest will be able to control host side page
> > cache eviction, too (e.g. via discard or hole punch operations).
> 
> Not sure how? this is similar to mmapping virtual memory by any userspace 
> process. Any host userspace process can do such attack on host page cache
> using mincore & mmap shared file. 

Mincore is for monitoring, not cached eviction. And it's not
required to observe cache residency, either. That's a wide open
field containing an uncountable number of moles...

> But i don't think guest can do this alone. For virtio-pmem usecase
> guest won't be using page cache so timing attack from only guest
> side is not possible unless host userspace can run checks on page
> cache eviction state using mincore etc.  As rightly described by
> Rik, guest will only access its own page cache pages and if guest
> page cache is managed directly by host, this saves alot of effort
> for guest in transferring guest state of page cache.  

Until you have images (and hence host page cache) shared between
multiple guests. People will want to do this, because it means they
only need a single set of pages in host memory for executable
binaries rather than a set of pages per guest. Then you have
multiple guests being able to detect residency of the same set of
pages. If the guests can then, in any way, control eviction of the
pages from the host cache, then we have a guest-to-guest information
leak channel.

i.e. it's something we need to be aware of and really careful about
enabling infrastructure that /will/ be abused if guests can find a
way to influence the host side cache residency.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v3 0/5] kvm "virtio pmem" device

2019-01-09 Thread Dave Chinner
On Wed, Jan 09, 2019 at 08:17:31PM +0530, Pankaj Gupta wrote:
>  This patch series has implementation for "virtio pmem". 
>  "virtio pmem" is fake persistent memory(nvdimm) in guest 
>  which allows to bypass the guest page cache. This also
>  implements a VIRTIO based asynchronous flush mechanism.  

H. Sharing the host page cache direct into the guest VM. Sounds
like a good idea, but.

This means the guest VM can now run timing attacks to observe host
side page cache residency, and depending on the implementation I'm
guessing that the guest will be able to control host side page
cache eviction, too (e.g. via discard or hole punch operations).

Which means this functionality looks to me like a new vector for
information leakage into and out of the guest VM via guest
controlled host page cache manipulation.

https://arxiv.org/pdf/1901.01161

I might be wrong, but if I'm not we're going to have to be very
careful about how guest VMs can access and manipulate host side
resources like the page cache.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Problems with VM_MIXEDMAP removal from /proc//smaps

2018-11-01 Thread Dave Chinner
On Wed, Oct 31, 2018 at 05:59:17AM +, y-g...@fujitsu.com wrote:
> > On Mon, Oct 29, 2018 at 11:30:41PM -0700, Dan Williams wrote:
> > > On Thu, Oct 18, 2018 at 5:58 PM Dave Chinner  wrote:
> > In summary:
> > 
> > MAP_DIRECT is an access hint.
> > 
> > MAP_SYNC provides a data integrity model guarantee.
> > 
> > MAP_SYNC may imply MAP_DIRECT for specific implementations,
> > but it does not require or guarantee MAP_DIRECT.
> > 
> > Let's compare that with O_DIRECT:
> > 
> > O_DIRECT in an access hint.
> > 
> > O_DSYNC provides a data integrity model guarantee.
> > 
> > O_DSYNC may imply O_DIRECT for specific implementations, but
> > it does not require or guarantee O_DIRECT.
> > 
> > Consistency in access and data integrity models is a good thing. DAX
> > and pmem is not an exception. We need to use a model we know works
> > and has proven itself over a long period of time.
> 
> Hmmm, then, I would like to know all of the reasons of breakage of MAP_DIRECT.
> (I'm not opposed to your opinion, but I need to know it.)
> 
> In O_DIRECT case, in my understanding, the reason of breakage of O_DIRECT is 
> "wrong alignment is specified by application", right?

O_DIRECT has defined memory and offset alignment restrictions, and
will return an error to userspace when they are violated. It does
not fall back to buffered IO in this case. MAP_DIRECT has no
equivalent restriction, so IO alignment of O_DIRECT is largely
irrelevant here.

What we are talking about here is that some filesystems can only do
certain operations through buffered IO, such as block allocation or
file extension, and so silently fall back to doing them via buffered
IO even when O_DIRECT is specified. The old direct IO code used to
be full of conditionals to allow this - I think DIO_SKIP_HOLES is
only one remaining:

/*
 * For writes that could fill holes inside i_size on a
 * DIO_SKIP_HOLES filesystem we forbid block creations: only
 * overwrites are permitted. We will return early to the caller
 * once we see an unmapped buffer head returned, and the caller
 * will fall back to buffered I/O.
 *
 * Otherwise the decision is left to the get_blocks method,
 * which may decide to handle it or also return an unmapped
 * buffer head.
 */
create = dio->op == REQ_OP_WRITE;
if (dio->flags & DIO_SKIP_HOLES) {
if (fs_startblk <= ((i_size_read(dio->inode) - 1) >>
i_blkbits))
create = 0;
}

Other cases like file extension cases are caught by the filesystems
before calling into the DIO code itself, so there's multiple avenues
for O_DIRECT transparently falling back to buffered IO.

This means the applications don't fail just because the filesystem
can't do a specific operation via O_DIRECT. The data writes still
succeed because they fall back to buffered IO, and the application
is blissfully unaware that the filesystem behaved that way.

> When filesystem can not use O_DIRECT and it uses page cache instead,
> then system uses more memory resource than user's expectation.

That's far better than failing unexpectedly because the app
unexpectedly came across a hole in the file (e.g. someone ran
sparsify across the filesystem).

> So, there is a side effect, and it may cause other trouble.
> (memory pressure, expected performance can not be gained, and so on ..)

Which is why people are supposed to test their systems before they
put them into production.

I've lost count of the number of times I've heard "but O_DIRECT is
supposed to make things faster!" because people don't understand
exactly what it does or means. Bypassing the page cache does not
magically make applications go faster - it puts the responsibility
for doing optimal IO on the application, not the kernel.

MAP_DIRECT will be no different. It's no guarantee that it will make
things faster, or that everything will just work as users expect
them to. It specifically places the responsibility for performing IO
in an optimal fashion on the application and the user for making
sure that it is fit for their purposes. Like O_DIRECT, using
MAP_DIRECT means "I, the application, know exactly what I'm doing,
so get out of the way as much as possible because I'm taking
responsibility for issuing IO in the most optimal manner now".

> In such case its administrator (or technical support engineer) needs to 
> struggle to
> investigate what is the reason.

That's no different to performance problems that arise from
inappro

Re: Problems with VM_MIXEDMAP removal from /proc//smaps

2018-10-18 Thread Dave Chinner
On Thu, Oct 18, 2018 at 12:10:13PM -0700, Dan Williams wrote:
> The only caveat to address all the use cases for applications making
> decisions based on the presence of DAX

And that's how we've got into this mess.

Applications need to focus on the functionality they require, not
the technology that provides it. That's the root of the we are
trying to solve here and really I don't care if we have to break
existing applications to do it. i.e. we've made no promises about
API/ABI stability and the functionality is still experimental.

Fundamentally, DAX is a technology, not an API property. The two
"DAX" API properties that matter to applications are:

1. does mmap allow us to use CPU flush instructions for data
integrity operations safely? And
2. can mmap directly access the backing store without
incurring any additional overhead?

MAP_SYNC provides #1, MAP_DIRECT provides #2, and DAX provides both.
However, they do not define DAX, nor does DAX define them. e.g.

MAP_SYNC can be provided by a persistent memory page cache.
But a persistent memory page cache does not provide
MAP_DIRECT.

MAP_SYNC can be provided by filesystem DAX, but *only* when
direct access is used. i.e. MAP_SYNC | MAP_DIRECT

MAP_DIRECT can be provided by filesystem DAX, but it does
not imply or require MAP_SYNC behaviour.

IOWs, using MAP_SYNC and/or MAP_DIRECT to answering an "is DAX
present" question ties the API to a technology rather than to the
functionality the technology provides applications.

i.e. If the requested behaviour/property is not available from the
underlying technology, then the app needs to handle that error and
use a different access method.

> applications making
> decisions based on the presence of DAX
> is to make MADV_DIRECT_ACCESS
> fail if the mapping was not established with MAP_SYNC.

And so this is wrong - MADV_DIRECT_ACCESS does not require MAP_SYNC.

It is perfectly legal for MADV_DIRECT_ACCESS to be used without
MAP_SYNC - the app just needs to use msync/fsync instead.

Wanting to enable full userspace CPU data sync semantics via
madvise() implies we also need MADV_SYNC in addition to
MADV_DIRECT_ACCESS.

i.e. Apps that are currently testing for dax should use
mmap(MAP_SYNC|MAP_DIRECT) or madvise(MADV_SYNC|MADV_DIRECT) and they
will fail if the underlying storage is not DAX capable. The app
doesn't need to poke at anything else to see if DAX is enabled - if
the functionality is there then it will work, otherwise they need to
handle the error and do something else.

> That way we
> have both a way to assert that page cache resources are not being
> consumed, and that the kernel is handling metadata synchronization for
> any write-faults.

Yes, we need to do that, but not at the cost of having the API
prevent apps from ever being able to use direct access + msync/fsync
data integrity operations.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Problems with VM_MIXEDMAP removal from /proc//smaps

2018-10-18 Thread Dave Chinner
On Thu, Oct 18, 2018 at 04:55:55PM +0200, Jan Kara wrote:
> On Thu 18-10-18 11:25:10, Dave Chinner wrote:
> > On Wed, Oct 17, 2018 at 04:23:50PM -0400, Jeff Moyer wrote:
> > > MAP_SYNC
> > > - file system guarantees that metadata required to reach faulted-in file
> > >   data is consistent on media before a write fault is completed.  A
> > >   side-effect is that the page cache will not be used for
> > >   writably-mapped pages.
> > 
> > I think you are conflating current implementation with API
> > requirements - MAP_SYNC doesn't guarantee anything about page cache
> > use. The man page definition simply says "supported only for files
> > supporting DAX" and that it provides certain data integrity
> > guarantees. It does not define the implementation.
> > 
> > We've /implemented MAP_SYNC/ as O_DSYNC page fault behaviour,
> > because that's the only way we can currently provide the required
> > behaviour to userspace. However, if a filesystem can use the page
> > cache to provide the required functionality, then it's free to do
> > so.
> > 
> > i.e. if someone implements a pmem-based page cache, MAP_SYNC data
> > integrity could be provided /without DAX/ by any filesystem using
> > that persistent page cache. i.e. MAP_SYNC really only requires
> > mmap() of CPU addressable persistent memory - it does not require
> > DAX. Right now, however, the only way to get this functionality is
> > through a DAX capable filesystem on dax capable storage.
> > 
> > And, FWIW, this is pretty much how NOVA maintains DAX w/ COW - it
> > COWs new pages in pmem and attaches them a special per-inode cache
> > on clean->dirty transition. Then on data sync, background writeback
> > or crash recovery, it migrates them from the cache into the file map
> > proper via atomic metadata pointer swaps.
> > 
> > IOWs, NOVA provides the correct MAP_SYNC semantics by using a
> > separate persistent per-inode write cache to provide the correct
> > crash recovery semantics for MAP_SYNC.
> 
> Corect. NOVA would be able to provide MAP_SYNC semantics without DAX. But
> effectively it will be also able to provide MAP_DIRECT semantics, right?

Yes, I think so. It still needs to do COW on first write fault,
but then the app has direct access to the data buffer until it is
cleaned and put back in place. The "put back in place" is just an
atomic swap of metadata pointers, so it doesn't need the page cache
at all...

> Because there won't be DRAM between app and persistent storage and I don't
> think COW tricks or other data integrity methods are that interesting for
> the application.

Not for the application, but the filesystem still wants to support
snapshots and other such functionality that requires COW. And NOVA
doesn't have write-in-place functionality at all - it always COWs
on the clean->dirty transition.

> Most users of O_DIRECT are concerned about getting close
> to media speed performance and low DRAM usage...

*nod*

> > > and what I think Dan had proposed:
> > > 
> > > mmap flag, MAP_DIRECT
> > > - file system guarantees that page cache will not be used to front 
> > > storage.
> > >   storage MUST be directly addressable.  This *almost* implies MAP_SYNC.
> > >   The subtle difference is that a write fault /may/ not result in metadata
> > >   being written back to media.
> > 
> > SIimilar to O_DIRECT, these semantics do not allow userspace apps to
> > replace msync/fsync with CPU cache flush operations. So any
> > application that uses this mode still needs to use either MAP_SYNC
> > or issue msync/fsync for data integrity.
> > 
> > If the app is using MAP_DIRECT, the what do we do if the filesystem
> > can't provide the required semantics for that specific operation? In
> > the case of O_DIRECT, we fall back to buffered IO because it has the
> > same data integrity semantics as O_DIRECT and will always work. It's
> > just slower and consumes more memory, but the app continues to work
> > just fine.
> > 
> > Sending SIGBUS to apps when we can't perform MAP_DIRECT operations
> > without using the pagecache seems extremely problematic to me.  e.g.
> > an app already has an open MAP_DIRECT file, and a third party
> > reflinks it or dedupes it and the fs has to fall back to buffered IO
> > to do COW operations. This isn't the app's fault - the kernel should
> > just fall back transparently to using the page cache for the
> > MAP_DIRECT app and just keep working, just like it would if it was
> > using O_DIRECT read/write.
> 
> There's another option of faili

Re: Problems with VM_MIXEDMAP removal from /proc//smaps

2018-10-17 Thread Dave Chinner
guarantees that page cache will not be used to front storage.
>   storage MUST be directly addressable.  This *almost* implies MAP_SYNC.
>   The subtle difference is that a write fault /may/ not result in metadata
>   being written back to media.

SIimilar to O_DIRECT, these semantics do not allow userspace apps to
replace msync/fsync with CPU cache flush operations. So any
application that uses this mode still needs to use either MAP_SYNC
or issue msync/fsync for data integrity.

If the app is using MAP_DIRECT, the what do we do if the filesystem
can't provide the required semantics for that specific operation? In
the case of O_DIRECT, we fall back to buffered IO because it has the
same data integrity semantics as O_DIRECT and will always work. It's
just slower and consumes more memory, but the app continues to work
just fine.

Sending SIGBUS to apps when we can't perform MAP_DIRECT operations
without using the pagecache seems extremely problematic to me.  e.g.
an app already has an open MAP_DIRECT file, and a third party
reflinks it or dedupes it and the fs has to fall back to buffered IO
to do COW operations. This isn't the app's fault - the kernel should
just fall back transparently to using the page cache for the
MAP_DIRECT app and just keep working, just like it would if it was
using O_DIRECT read/write.

The point I'm trying to make here is that O_DIRECT is a /hint/, not
a guarantee, and it's done that way to prevent applications from
being presented with transient, potentially fatal error cases
because a filesystem implementation can't do a specific operation
through the direct IO path.

IMO, MAP_DIRECT needs to be a hint like O_DIRECT and not a
guarantee. Over time we'll end up with filesystems that can
guarantee that MAP_DIRECT is always going to use DAX, in the same
way we have filesystems that guarantee O_DIRECT will always be
O_DIRECT (e.g. XFS). But if we decide that MAP_DIRECT must guarantee
no page cache will ever be used, then we are basically saying
"filesystems won't provide MAP_DIRECT even in common, useful cases
because they can't provide MAP_DIRECT in all cases." And that
doesn't seem like a very good solution to me.

> and this is what I think you were proposing, Jan:
> 
> madvise flag, MADV_DIRECT_ACCESS
> - same semantics as MAP_DIRECT, but specified via the madvise system call

Seems to be the equivalent of fcntl(F_SETFL, O_DIRECT). Makes sense
to have both MAP_DIRECT and MADV_DIRECT_ACCESS to me - one is an
init time flag, the other is a run time flag.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Snapshot target and DAX-capable devices

2018-09-04 Thread Dave Chinner
On Fri, Aug 31, 2018 at 11:42:55AM +0200, Jan Kara wrote:
> On Fri 31-08-18 09:38:09, Dave Chinner wrote:
> > On Thu, Aug 30, 2018 at 03:47:32PM -0400, Mikulas Patocka wrote:
> > > You can't support dax on snapshot - if someone maps a block and the block 
> > > needs to be moved, then what?
> > 
> > This is only a problem for access via mmap and page faults.

> > It's a whole different ballgame for a dm-snapshot device - block
> > devices are completely unaware of page faults to DAX file mappings.
> 
> Actually, block devices are not completely unaware of DAX page faults -
> they will get ->direct_access callback for the fault range. It does not
> currently convey enough information - we also need to inform the block
> device whether it is read or write. But that's about all that's needed to
> add AFAICT. And by comparing returned PFN with the one we have stored in
> the radix tree (which we have if that file offset is mapped by anybody),
> the filesystem / DAX code can tell whether remapping happened and do the
> unmapping.

I forgot about the direct access call.

But it seems like a hack to redefine the simple, fast sector-to-pfn
translation into a slow and potentially resource hungry interface
for physical storage reallocation.  Doing storage layer COW
operations inside direct_access takes us straight back to the bad
ways of get_block() interfaces. We moved all the filesystem
allocation to iomap so that the storage management is separated from
the mm/physical address translation side of DAX - doing block device
storage management operations inside ->direct_access effectively
reverts that separation and so just seems like a hack to me.

Oh, right, DAX. Silly me. :/

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Snapshot target and DAX-capable devices

2018-08-30 Thread Dave Chinner
On Thu, Aug 30, 2018 at 03:47:32PM -0400, Mikulas Patocka wrote:
> 
> 
> On Thu, 30 Aug 2018, Jeff Moyer wrote:
> 
> > Mike Snitzer  writes:
> > 
> > > Until we properly add DAX support to dm-snapshot I'm afraid we really do
> > > need to tolerate this "regression".  Since reality is the original
> > > support for snapshot of a DAX DM device never worked in a robust way.
> > 
> > Agreed.
> > 
> > -Jeff
> 
> You can't support dax on snapshot - if someone maps a block and the block 
> needs to be moved, then what?

This is only a problem for access via mmap and page faults.

At the filesystem level, it's no different to the existing direct IO
algorithm for read/write IO - we simply allocate new space, copy the
data we need to copy into the new space (may be no copy needed), and
then write the new data into the new space. I'm pretty sure that for
bio-based IO to dm-snapshot devices the algorithm will be exactly
the same.

However, for direct access via mmap, we have to modify how the
userspace virtual address is mapped to the physical location. IOWs,
during the COW operation, we have to invalidate all existing user
mappings we have for that physical address. This means we have to do
an invalidation after the allocate/copy part of the COW operation.

If we are doing this during a page fault, it means we'll probably
have to restart the page fault so it can look up the new physical
address associated with the faulting user address. After we've done
the invalidation, any new (or restarted) page fault finds the
location of new copy we just made, maps it into the user address
space, updates the ptes and we're all good.

Well, that's the theory. We haven't implemented this for XFS yet, so
it might end up a little different, and we might yet hit unexpected
problems (it's DAX, that's what happens :/).

It's a whole different ballgame for a dm-snapshot device - block
devices are completely unaware of page faults to DAX file mappings.
We'll need the filesystem to be aware it's on a remappable block
device, and when we take a DAX write fault we'll need to ask the
underlying device to remap the block and treat it like the
filesystem COW case above. We'll need to do this remap/invalidate
dance in the write IO path, too, because a COW by the block device
is no different to filesystem COW in that path.

Basically, it's the same algorithm as the filesystem COW case, we
just get the physical location of the data and the notification of
block changing physical location from a different interface.

H. ISTR that someone has been making a few noises recently about
virtual block address space mapping interfaces that could help solve
this problem

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


  1   2   3   >