Re: Enabling peer to peer device transactions for PCIe devices

2016-11-24 Thread Christoph Hellwig
On Thu, Nov 24, 2016 at 11:11:34AM -0700, Logan Gunthorpe wrote:
> * Regular DAX in the FS doesn't work at this time because the FS can
> move the file you think your transfer to out from under you. Though I
> understand there's been some work with XFS to solve that issue.

The file system will never move anything under locked down pages,
locking down pages is used exactly to protect against that.  So as long
as we page structures available RDMA to/from device memory _from kernel
space_ is trivial, although for file systems to work properly you
really want a notification to the consumer if the file systems wants
to remove the mapping.  We have implemented that using FL_LAYOUTS locks
for NFSD, but only XFS supports it so far.  Without that a long term
locked down region of memory (e.g. a kernel MR) would prevent various
file operations that would simply hang.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 3/6] dax: add tracepoint infrastructure, PMD tracing

2016-11-24 Thread Al Viro
On Fri, Nov 25, 2016 at 06:06:42PM +1100, Dave Chinner wrote:

> > Tell that to Linus.  You had been in the room, IIRC, when that had been
> > brought up this year in Santa Fe.
> 
> No, I wasn't at KS or plumbers, so this is all news to me.

Sorry, thought you had been at KS ;-/  My apologies...

[snip bloody good points I fully agree with]

> I understand why there is a desire for stable tracepoints, and
> that's why I suggested that there should be an in-kernel API to
> declare stable tracepoints. That way we can have the best of both
> worlds - tracepoints that applications need to be stable can be
> declared, reviewed and explicitly marked as stable in full knowledge
> of what that implies. The rest of the vast body of tracepoints can
> be left as mutable with no stability or existence guarantees so that
> developers can continue to treat them in a way that best suits
> problem diagnosis without compromising the future development of the
> code being traced. If userspace finds some of those tracepoints
> useful, then they can be taken through the process of making them
> into a maintainable stable form and being marked as such.

My impression is that nobody (at least kernel-side) wants them to be
a stable ABI, so long as nobody in userland screams about their code
being broken, everything is fine.  As usual, if nobody notices an ABI
change, it hasn't happened.  The question is what happens when somebody
does.

> We already have distros mounting the tracing subsystem on
> /sys/kernel/tracing. Expose all the stable tracepoints there, and
> leave all the other tracepoints under /sys/kernel/debug/tracing.
> Simple, clear separation between stable and mutable diagnostic
> tracepoints for users, combined with a simple, clear in-kernel API
> and process for making tracepoints stable

Yep.  That kind of separation would be my preference as well - ideally,
with review for stable ones being a lot less casual that for unstable;
AFAICS what happens now is that we have no mechanisms for marking them as
stable or unstable and everything keeps going on hope that nobody will
cause a mess by creating such a userland dependency.  So far it's been mostly
working, but as the set of tracepoints (and their use) gets wider and wider,
IMO it's only matter of time until we get seriously screwed that way.

Basically, we are gambling on the next one to be cast in stone by userland
dependency being sane enough to make it possible to maintain it indefinitely
and I don't like the odds.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 3/6] dax: add tracepoint infrastructure, PMD tracing

2016-11-24 Thread Dave Chinner
On Fri, Nov 25, 2016 at 04:14:19AM +, Al Viro wrote:
> [Linus Cc'd]
> 
> On Fri, Nov 25, 2016 at 01:49:18PM +1100, Dave Chinner wrote:
> > > they have become parts of stable userland ABI and are to be maintained
> > > indefinitely.  Don't expect "tracepoints are special case" to prevent 
> > > that.
> > 
> > I call bullshit just like I always do when someone spouts this
> > "tracepoints are stable ABI" garbage.
> 
> > Quite frankly, anyone that wants to stop us from
> > adding/removing/changing tracepoints or the code that they are
> > reporting information about "because ABI" can go take a long walk
> > off a short cliff.  Diagnostic tracepoints are not part of the
> > stable ABI. End of story.
> 
> Tell that to Linus.  You had been in the room, IIRC, when that had been
> brought up this year in Santa Fe.

No, I wasn't at KS or plumbers, so this is all news to me. Beleive
me, if I was in the room when this discussion was in progress, you'd
remember it /very clearly/.

> "End of story" is not going to be
> yours (or mine, for that matter) to declare - Linus is the only one who
> can do that.  If he says "if userland code relies upon it, so that
> userland code needs to be fixed" - I'm very happy (and everyone involved
> can count upon quite a few free drinks from me at the next summit).  If
> it's "that userland code really shouldn't have relied upon it, and it's
> real unfortunate that it does, but we still get to keep it working" -
> too bad, "because ABI" is the reality and we will be the ones to take
> that long walk.

When the tracepoint infrastructure was added it was considered a
debugging tool and not stable - it was even exposed through
/sys/kernel/debug! We connected up the ~280 /debug/ tracepoints we
had in XFS at the time with the understanding it was a /diagnostic
tool/. We exposed all sorts of internal details we'd previously been
exposing with tracing through lcrash and kdb (and Irix before that)
so we could diagnose problems quickly on a running kernel.

The scope of tracepoints may have grown since then, but it does not
change the fact that many of the tracepoints that were added years
ago were done under the understanding that it was a mutable
interface and nobody could rely on any specific tracepoint detail
remaining unchanged.

We're still treating then as mutable diagnostic and debugging aids
across the kernel. In XFS, We've now got over *500* unique trace
events and *650* tracepoints; ignoring comments, *4%* of the entire
XFS kernel code base is tracing code.  We expose structure contents,
transaction states, locking algorithms, object life cycles, journal
operations, etc. All the new reverse mapping and shared data extent
code that has been merged in 4.8 and 4.9 has been extensively
exposed by tracepoints - these changes also modified a significant
number of existing tracepoints.

Put simply: every major and most minor pieces of functionality in
XFS are exposed via tracepoints.

Hence if the stable ABI tracepoint rules you've just described are
going to enforced, it will mean we will not be able to change
anything signficant in XFS because almost everything significant we
do involves changing tracepoints in some way. This leaves us with
three unacceptable choices:

1. stop developing XFS so we can maintain the stable
tracepoint ABI;

2. ignore the ABI rules and hope that Linus keeps pulling
code that obviously ignores the ABI rules; or

3. screw over our upstream/vanilla kernel users by removing
the tracepoints from Linus' tree and suck up the pain of
maintaining an out of tree patch for XFS developers and
distros so kernel tracepoint ABI rules can be ignored.

Nobody wins if these are the only choices we are being given.

I understand why there is a desire for stable tracepoints, and
that's why I suggested that there should be an in-kernel API to
declare stable tracepoints. That way we can have the best of both
worlds - tracepoints that applications need to be stable can be
declared, reviewed and explicitly marked as stable in full knowledge
of what that implies. The rest of the vast body of tracepoints can
be left as mutable with no stability or existence guarantees so that
developers can continue to treat them in a way that best suits
problem diagnosis without compromising the future development of the
code being traced. If userspace finds some of those tracepoints
useful, then they can be taken through the process of making them
into a maintainable stable form and being marked as such.

We already have distros mounting the tracing subsystem on
/sys/kernel/tracing. Expose all the stable tracepoints there, and
leave all the other tracepoints under /sys/kernel/debug/tracing.
Simple, clear separation between stable and mutable diagnostic
tracepoints for users, combined with a simple, clear in-kernel API
and process for making tracepoints stable

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [PATCH 3/6] dax: add tracepoint infrastructure, PMD tracing

2016-11-24 Thread Al Viro
[Linus Cc'd]

On Fri, Nov 25, 2016 at 01:49:18PM +1100, Dave Chinner wrote:
> > they have become parts of stable userland ABI and are to be maintained
> > indefinitely.  Don't expect "tracepoints are special case" to prevent that.
> 
> I call bullshit just like I always do when someone spouts this
> "tracepoints are stable ABI" garbage.

> Quite frankly, anyone that wants to stop us from
> adding/removing/changing tracepoints or the code that they are
> reporting information about "because ABI" can go take a long walk
> off a short cliff.  Diagnostic tracepoints are not part of the
> stable ABI. End of story.

Tell that to Linus.  You had been in the room, IIRC, when that had been
brought up this year in Santa Fe.  "End of story" is not going to be
yours (or mine, for that matter) to declare - Linus is the only one who
can do that.  If he says "if userland code relies upon it, so that
userland code needs to be fixed" - I'm very happy (and everyone involved
can count upon quite a few free drinks from me at the next summit).  If
it's "that userland code really shouldn't have relied upon it, and it's
real unfortunate that it does, but we still get to keep it working" -
too bad, "because ABI" is the reality and we will be the ones to take
that long walk.

What I heard from Linus sounded a lot closer to the second variant.
_IF_ I have misinterpreted that, I'd love to hear that.  Linus?

PS: I'm dead serious about large amounts of booze of choice at LSFS 2017.
Bribery or shared celebration - call it whatever you like; I really,
really want to have tracepoints free from ABIfication concerns.  They
certainly are useful for debugging purposes - no arguments here.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 3/6] dax: add tracepoint infrastructure, PMD tracing

2016-11-24 Thread Dave Chinner
On Wed, Nov 23, 2016 at 11:44:19AM -0700, Ross Zwisler wrote:
> Tracepoints are the standard way to capture debugging and tracing
> information in many parts of the kernel, including the XFS and ext4
> filesystems.  Create a tracepoint header for FS DAX and add the first DAX
> tracepoints to the PMD fault handler.  This allows the tracing for DAX to
> be done in the same way as the filesystem tracing so that developers can
> look at them together and get a coherent idea of what the system is doing.
> 
> I added both an entry and exit tracepoint because future patches will add
> tracepoints to child functions of dax_iomap_pmd_fault() like
> dax_pmd_load_hole() and dax_pmd_insert_mapping(). We want those messages to
> be wrapped by the parent function tracepoints so the code flow is more
> easily understood.  Having entry and exit tracepoints for faults also
> allows us to easily see what filesystems functions were called during the
> fault.  These filesystem functions get executed via iomap_begin() and
> iomap_end() calls, for example, and will have their own tracepoints.
> 
> For PMD faults we primarily want to understand the faulting address and
> whether it fell back to 4k faults.  If it fell back to 4k faults the
> tracepoints should let us understand why.
> 
> I named the new tracepoint header file "fs_dax.h" to allow for device DAX
> to have its own separate tracing header in the same directory at some
> point.
> 
> Here is an example output for these events from a successful PMD fault:
> 
> big-2057  [000]    136.396855: dax_pmd_fault: shared mapping write
> address 0x10505000 vm_start 0x1020 vm_end 0x1070 pgoff 0x200
> max_pgoff 0x1400
> 
> big-2057  [000]    136.397943: dax_pmd_fault_done: shared mapping write
> address 0x10505000 vm_start 0x1020 vm_end 0x1070 pgoff 0x200
> max_pgoff 0x1400 NOPAGE

Can we make the output use the same format as most of the filesystem
code? i.e. the output starts with backing device + inode number like
so:

xfs_ilock:dev 8:96 ino 0x493 flags ILOCK_EXCL

This way we can filter the output easily across both dax and
filesystem tracepoints with 'grep "ino 0x493"'...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 2/6] dax: remove leading space from labels

2016-11-24 Thread Dan Williams
On Thu, Nov 24, 2016 at 1:11 AM, Jan Kara  wrote:
> On Wed 23-11-16 11:44:18, Ross Zwisler wrote:
>> No functional change.
>>
>> As of this commit:
>>
>> commit 218dd85887da (".gitattributes: set git diff driver for C source code
>> files")
>>
>> git-diff and git-format-patch both generate diffs whose hunks are correctly
>> prefixed by function names instead of labels, even if those labels aren't
>> indented with spaces.
>
> Fine by me. I just have some 4 remaining DAX patches (will send them out
> today) and they will clash with this. So I'd prefer if this happened after
> they are merged...

Let's just leave them alone, it's not like this thrash buys us
anything at this point.  We can just stop including spaces in new
code.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH] x86: fix kaslr and memmap collision

2016-11-24 Thread Dan Williams
On Wed, Nov 23, 2016 at 4:04 PM, Dave Chinner  wrote:
> On Tue, Nov 22, 2016 at 11:01:32AM -0800, Dan Williams wrote:
>> On Tue, Nov 22, 2016 at 10:54 AM, Kees Cook  wrote:
>> > On Tue, Nov 22, 2016 at 9:26 AM, Dan Williams  
>> > wrote:
>> >> No, you're right, we need to handle multiple ranges.  Since the
>> >> mem_avoid array is statically allocated perhaps we can handle up to 4
>> >> memmap= entries, but past that point disable kaslr for that boot?
>> >
>> > Yeah, that seems fine to me. I assume it's rare to have 4?
>> >
>>
>> It should be rare to have *one* since ACPI 6.0 added support for
>> communicating persistent memory ranges.  However there are legacy
>> nvdimm users that I know are doing at least 2, but I have hard time
>> imagining they would ever do more than 4.
>
> I doubt it's rare amongst the people using RAM to emulate pmem for
> filesystem testing purposes. My "pmem" test VM always has at least 2
> ranges set to give me two discrete pmem devices, and I have used 4
> from time to time to do things like test multi-volume scratch XFS
> filesystems in xfstests (i.e. data, log and realtime volumes) so I
> didn't need to play games with partitioning or DM...

Right, but for testing do you need kaslr to be active? You can have as
many memmap regions as you want, we'll just stop trying to find a
random kernel base address after you've defined 4.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-24 Thread Logan Gunthorpe


On 24/11/16 09:42 AM, Jason Gunthorpe wrote:
> There are three cases to worry about:
>  - Coherent long lived page table mirroring (RDMA ODP MR)
>  - Non-coherent long lived page table mirroring (RDMA MR)
>  - Short lived DMA mapping (everything else)
> 
> Like you say below we have to handle short lived in the usual way, and
> that covers basically every device except IB MRs, including the
> command queue on a NVMe drive.

Yes, this makes sense to me. Though I thought regular IB MRs with
regular memory currently pinned the pages (despite being long lived)
that's why we can run up against the "max locked memory" limit. It
doesn't seem so terrible if GPU memory had a similar restriction until
ODP like solutions get implemented.

>> Yeah, we've had RDMA and O_DIRECT transfers to PCIe backed ZONE_DEVICE
>> memory working for some time. I'd say it's a good fit. The main question
>> we've had is how to expose PCIe bars to userspace to be used as MRs and
>> such.

> Is there any progress on that?

Well, I guess there's some consensus building to do. The existing
options are:

* Device DAX: which could work but the problem I see with it is that it
only allows one application to do these transfers. Or there would have
to be some user-space coordination to figure which application gets what
memeroy.

* Regular DAX in the FS doesn't work at this time because the FS can
move the file you think your transfer to out from under you. Though I
understand there's been some work with XFS to solve that issue.

Though, we've been considering that the backed memory would be
non-volatile which adds some of this complexity. If the memory were
volatile the kernel would just need to do some relatively straight
forward allocation to user-space when asked. For example, with NVMe, the
kernel could give chunks of the CMB buffer to userspace via an mmap call
to /dev/nvmeX. Though I think there's been some push back against things
like that as well.

> I still don't quite get what iopmem was about.. I thought the
> objection to uncachable ZONE_DEVICE & DAX made sense, so running DAX
> over iopmem and still ending up with uncacheable mmaps still seems
> like a non-starter to me...

The latest incarnation of iopmem simply created a block device backed by
ZONE_DEVICE memory on a PCIe BAR. We then put a DAX FS on it and
user-space could mmap the files and send them to other devices to do P2P
transfers.

I don't think there was a hard objection to uncachable ZONE_DEVICE and
DAX. We did try our experimental hardware with cached ZONE_DEVICE and it
did work but the performance was beyond unusable (which may be a
hardware issue). In the end I feel the driver would have to decide the
most appropriate caching for the hardware and I don't understand why WC
or UC wouldn't work with ZONE_DEVICE.

Logan
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-24 Thread Logan Gunthorpe
Hey,

On 24/11/16 02:45 AM, Christian König wrote:
> E.g. it can happen that PCI device A exports it's BAR using ZONE_DEVICE.
> Not PCI device B (a SATA device) can directly read/write to it because
> it is on the same bus segment, but PCI device C (a network card for
> example) can't because it is on a different bus segment and the bridge
> can't handle P2P transactions.

Yeah, that could be an issue but in our experience we have yet to see
it. We've tested with two separate PCI buses on different CPUs connected
through QPI links and it works fine. (It is rather slow but I understand
Intel has improved the bottleneck in newer CPUs than the ones we tested.)

It may just be older hardware that has this issue. I expect that as long
as a failed transfer can be handled gracefully by the initiator I don't
see a need to predetermine whether a device can see another devices memory.


Logan
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 3/6] dax: add tracepoint infrastructure, PMD tracing

2016-11-24 Thread Al Viro
On Wed, Nov 23, 2016 at 11:44:19AM -0700, Ross Zwisler wrote:
> Tracepoints are the standard way to capture debugging and tracing
> information in many parts of the kernel, including the XFS and ext4
> filesystems.  Create a tracepoint header for FS DAX and add the first DAX
> tracepoints to the PMD fault handler.  This allows the tracing for DAX to
> be done in the same way as the filesystem tracing so that developers can
> look at them together and get a coherent idea of what the system is doing.

It also has one hell of potential for becoming a massive nuisance.
Keep in mind that if any userland code becomes dependent on those - that's it,
they have become parts of stable userland ABI and are to be maintained
indefinitely.  Don't expect "tracepoints are special case" to prevent that.

So treat anything you add in that manner as potential stable ABI
you might have to keep around forever.  It's *not* a glorified debugging
printk.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-24 Thread Serguei Sagalovitch


On 2016-11-24 11:26 AM, Jason Gunthorpe wrote:

On Thu, Nov 24, 2016 at 10:45:18AM +0100, Christian König wrote:

Am 24.11.2016 um 00:25 schrieb Jason Gunthorpe:

There is certainly nothing about the hardware that cares
about ZONE_DEVICE vs System memory.

Well that is clearly not so simple. When your ZONE_DEVICE pages describe a
PCI BAR and another PCI device initiates a DMA to this address the DMA
subsystem must be able to check if the interconnection really works.

I said the hardware doesn't care.. You are right, we still have an
outstanding problem in Linux of how to generically DMA map a P2P
address - which is a different issue from getting the P2P address from
a __user pointer...

Jason
I agreed but the problem is that one issue immediately introduce another 
one

to solve and so on (if we do not want to cut corners). I would think  that
a lot of them interconnected because the way how one problem could be
solved may impact solution for another.

btw: about "DMA map a p2p address": Right now to enable  p2p between 
devices
it is required/recommended to disable iommu support  (e.g. intel iommu 
driver

has special logic for graphics and  comment "Reserve all PCI MMIO to avoid
peer-to-peer access").
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-24 Thread Jason Gunthorpe
On Wed, Nov 23, 2016 at 06:25:21PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 23/11/16 02:55 PM, Jason Gunthorpe wrote:
> >>> Only ODP hardware allows changing the DMA address on the fly, and it
> >>> works at the page table level. We do not need special handling for
> >>> RDMA.
> >>
> >> I am aware of ODP but, noted by others, it doesn't provide a general
> >> solution to the points above.
> > 
> > How do you mean?
> 
> I was only saying it wasn't general in that it wouldn't work for IB
> hardware that doesn't support ODP or other hardware  that doesn't do
> similar things (like an NVMe drive).

There are three cases to worry about:
 - Coherent long lived page table mirroring (RDMA ODP MR)
 - Non-coherent long lived page table mirroring (RDMA MR)
 - Short lived DMA mapping (everything else)

Like you say below we have to handle short lived in the usual way, and
that covers basically every device except IB MRs, including the
command queue on a NVMe drive.

> any complex allocators (GPU or otherwise) should respect that. And that
> seems like it should be the default way most of this works -- and I
> think it wouldn't actually take too much effort to make it all work now
> as is. (Our iopmem work is actually quite small and simple.)

Yes, absolutely, some kind of page pinning like locking is a hard
requirement.

> Yeah, we've had RDMA and O_DIRECT transfers to PCIe backed ZONE_DEVICE
> memory working for some time. I'd say it's a good fit. The main question
> we've had is how to expose PCIe bars to userspace to be used as MRs and
> such.

Is there any progress on that?

I still don't quite get what iopmem was about.. I thought the
objection to uncachable ZONE_DEVICE & DAX made sense, so running DAX
over iopmem and still ending up with uncacheable mmaps still seems
like a non-starter to me...

Serguei, what is your plan in GPU land for migration? Ie if I have a
CPU mapped page and the GPU moves it to VRAM, it becomes non-cachable
- do you still allow the CPU to access it? Or do you swap it back to
cachable memory if the CPU touches it?

One approach might be to mmap the uncachable ZONE_DEVICE memory and
mark it inaccessible to the CPU - DMA could still translate. If the
CPU needs it then the kernel migrates it to system memory so it
becomes cachable. ??

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 1/6] ext2: Return BH_New buffers for zeroed blocks

2016-11-24 Thread Jan Kara
So far we did not return BH_New buffers from ext2_get_blocks() when we
allocated and zeroed-out a block for DAX inode to avoid racy zeroing in
DAX code. This zeroing is gone these days so we can remove the
workaround.

Reviewed-by: Christoph Hellwig 
Signed-off-by: Jan Kara 
---
 fs/ext2/inode.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 046b642f3585..e626fe892c01 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -754,9 +754,8 @@ static int ext2_get_blocks(struct inode *inode,
mutex_unlock(>truncate_mutex);
goto cleanup;
}
-   } else {
-   *new = true;
}
+   *new = true;
 
ext2_splice_branch(inode, iblock, partial, indirect_blks, count);
mutex_unlock(>truncate_mutex);
-- 
2.6.6

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 5/6] dax: Call ->iomap_begin without entry lock during dax fault

2016-11-24 Thread Jan Kara
Currently ->iomap_begin() handler is called with entry lock held. If the
filesystem held any locks between ->iomap_begin() and ->iomap_end()
(such as ext4 which will want to hold transaction open), this would cause
lock inversion with the iomap_apply() from standard IO path which first
calls ->iomap_begin() and only then calls ->actor() callback which grabs
entry locks for DAX.

Fix the problem by nesting grabbing of entry lock inside ->iomap_begin()
- ->iomap_end() pair.

Signed-off-by: Jan Kara 
---
 fs/dax.c | 120 ++-
 1 file changed, 65 insertions(+), 55 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 38f996976ebf..be39633d346e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1077,6 +1077,15 @@ dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 }
 EXPORT_SYMBOL_GPL(dax_iomap_rw);
 
+static int dax_fault_return(int error)
+{
+   if (error == 0)
+   return VM_FAULT_NOPAGE;
+   if (error == -ENOMEM)
+   return VM_FAULT_OOM;
+   return VM_FAULT_SIGBUS;
+}
+
 /**
  * dax_iomap_fault - handle a page fault on a DAX file
  * @vma: The virtual memory area where the fault occurred
@@ -1109,12 +1118,6 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct 
vm_fault *vmf,
if (pos >= i_size_read(inode))
return VM_FAULT_SIGBUS;
 
-   entry = grab_mapping_entry(mapping, vmf->pgoff, 0);
-   if (IS_ERR(entry)) {
-   error = PTR_ERR(entry);
-   goto out;
-   }
-
if ((vmf->flags & FAULT_FLAG_WRITE) && !vmf->cow_page)
flags |= IOMAP_WRITE;
 
@@ -1125,9 +1128,15 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct 
vm_fault *vmf,
 */
error = ops->iomap_begin(inode, pos, PAGE_SIZE, flags, );
if (error)
-   goto unlock_entry;
+   return dax_fault_return(error);
if (WARN_ON_ONCE(iomap.offset + iomap.length < pos + PAGE_SIZE)) {
-   error = -EIO;   /* fs corruption? */
+   vmf_ret = dax_fault_return(-EIO);   /* fs corruption? */
+   goto finish_iomap;
+   }
+
+   entry = grab_mapping_entry(mapping, vmf->pgoff, 0);
+   if (IS_ERR(entry)) {
+   vmf_ret = dax_fault_return(PTR_ERR(entry));
goto finish_iomap;
}
 
@@ -1150,13 +1159,13 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct 
vm_fault *vmf,
}
 
if (error)
-   goto finish_iomap;
+   goto error_unlock_entry;
 
__SetPageUptodate(vmf->cow_page);
vmf_ret = finish_fault(vmf);
if (!vmf_ret)
vmf_ret = VM_FAULT_DONE_COW;
-   goto finish_iomap;
+   goto unlock_entry;
}
 
switch (iomap.type) {
@@ -1168,12 +1177,15 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct 
vm_fault *vmf,
}
error = dax_insert_mapping(mapping, iomap.bdev, sector,
PAGE_SIZE, , vma, vmf);
+   /* -EBUSY is fine, somebody else faulted on the same PTE */
+   if (error == -EBUSY)
+   error = 0;
break;
case IOMAP_UNWRITTEN:
case IOMAP_HOLE:
if (!(vmf->flags & FAULT_FLAG_WRITE)) {
vmf_ret = dax_load_hole(mapping, , vmf);
-   goto finish_iomap;
+   goto unlock_entry;
}
/*FALLTHRU*/
default:
@@ -1182,30 +1194,25 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct 
vm_fault *vmf,
break;
}
 
- finish_iomap:
-   if (ops->iomap_end) {
-   if (error || (vmf_ret & VM_FAULT_ERROR)) {
-   /* keep previous error */
-   ops->iomap_end(inode, pos, PAGE_SIZE, 0, flags,
-   );
-   } else {
-   error = ops->iomap_end(inode, pos, PAGE_SIZE,
-   PAGE_SIZE, flags, );
-   }
-   }
+ error_unlock_entry:
+   vmf_ret = dax_fault_return(error) | major;
  unlock_entry:
put_locked_mapping_entry(mapping, vmf->pgoff, entry);
- out:
-   if (error == -ENOMEM)
-   return VM_FAULT_OOM | major;
-   /* -EBUSY is fine, somebody else faulted on the same PTE */
-   if (error < 0 && error != -EBUSY)
-   return VM_FAULT_SIGBUS | major;
-   if (vmf_ret) {
-   WARN_ON_ONCE(error); /* -EBUSY from ops->iomap_end? */
-   return vmf_ret;
+ finish_iomap:
+   if (ops->iomap_end) {
+   int copied = PAGE_SIZE;
+
+   if (vmf_ret & VM_FAULT_ERROR)
+   copied = 0;
+   /*
+* The fault is done by now and there's no way 

[PATCH 3/6] dax: Avoid page invalidation races and unnecessary radix tree traversals

2016-11-24 Thread Jan Kara
Currently each filesystem (possibly through generic_file_direct_write()
or iomap_dax_rw()) takes care of invalidating page tables and evicting
hole pages from the radix tree when write(2) to the file happens. This
invalidation is only necessary when there is some block allocation
resulting from write(2). Furthermore in current place the invalidation
is racy wrt page fault instantiating a hole page just after we have
invalidated it.

So perform the page invalidation inside dax_do_io() where we can do it
only when really necessary and after blocks have been allocated so
nobody will be instantiating new hole pages anymore.

Reviewed-by: Christoph Hellwig 
Signed-off-by: Jan Kara 
---
 fs/dax.c | 28 +++-
 1 file changed, 11 insertions(+), 17 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 4534f0e232e9..ddf77ef2ca18 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -984,6 +984,17 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t 
length, void *data,
if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED))
return -EIO;
 
+   /*
+* Write can allocate block for an area which has a hole page mapped
+* into page tables. We have to tear down these mappings so that data
+* written by write(2) is visible in mmap.
+*/
+   if ((iomap->flags & IOMAP_F_NEW) && inode->i_mapping->nrpages) {
+   invalidate_inode_pages2_range(inode->i_mapping,
+ pos >> PAGE_SHIFT,
+ (end - 1) >> PAGE_SHIFT);
+   }
+
while (pos < end) {
unsigned offset = pos & (PAGE_SIZE - 1);
struct blk_dax_ctl dax = { 0 };
@@ -1042,23 +1053,6 @@ dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
if (iov_iter_rw(iter) == WRITE)
flags |= IOMAP_WRITE;
 
-   /*
-* Yes, even DAX files can have page cache attached to them:  A zeroed
-* page is inserted into the pagecache when we have to serve a write
-* fault on a hole.  It should never be dirtied and can simply be
-* dropped from the pagecache once we get real data for the page.
-*
-* XXX: This is racy against mmap, and there's nothing we can do about
-* it. We'll eventually need to shift this down even further so that
-* we can check if we allocated blocks over a hole first.
-*/
-   if (mapping->nrpages) {
-   ret = invalidate_inode_pages2_range(mapping,
-   pos >> PAGE_SHIFT,
-   (pos + iov_iter_count(iter) - 1) >> PAGE_SHIFT);
-   WARN_ON_ONCE(ret);
-   }
-
while (iov_iter_count(iter)) {
ret = iomap_apply(inode, pos, iov_iter_count(iter), flags, ops,
iter, dax_iomap_actor);
-- 
2.6.6

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 2/6] mm: Invalidate DAX radix tree entries only if appropriate

2016-11-24 Thread Jan Kara
Currently invalidate_inode_pages2_range() and invalidate_mapping_pages()
just delete all exceptional radix tree entries they find. For DAX this
is not desirable as we track cache dirtiness in these entries and when
they are evicted, we may not flush caches although it is necessary. This
can for example manifest when we write to the same block both via mmap
and via write(2) (to different offsets) and fsync(2) then does not
properly flush CPU caches when modification via write(2) was the last
one.

Create appropriate DAX functions to handle invalidation of DAX entries
for invalidate_inode_pages2_range() and invalidate_mapping_pages() and
wire them up into the corresponding mm functions.

Signed-off-by: Jan Kara 
---
 fs/dax.c| 71 +
 include/linux/dax.h |  3 +++
 mm/truncate.c   | 71 -
 3 files changed, 123 insertions(+), 22 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index cafd5597434b..4534f0e232e9 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -452,16 +452,37 @@ void dax_wake_mapping_entry_waiter(struct address_space 
*mapping,
__wake_up(wq, TASK_NORMAL, wake_all ? 0 : 1, );
 }
 
+static int __dax_invalidate_mapping_entry(struct address_space *mapping,
+ pgoff_t index, bool trunc)
+{
+   int ret = 0;
+   void *entry;
+   struct radix_tree_root *page_tree = >page_tree;
+
+   spin_lock_irq(>tree_lock);
+   entry = get_unlocked_mapping_entry(mapping, index, NULL);
+   if (!entry || !radix_tree_exceptional_entry(entry))
+   goto out;
+   if (!trunc &&
+   (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
+radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)))
+   goto out;
+   radix_tree_delete(page_tree, index);
+   mapping->nrexceptional--;
+   ret = 1;
+out:
+   put_unlocked_mapping_entry(mapping, index, entry);
+   spin_unlock_irq(>tree_lock);
+   return ret;
+}
 /*
  * Delete exceptional DAX entry at @index from @mapping. Wait for radix tree
  * entry to get unlocked before deleting it.
  */
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
 {
-   void *entry;
+   int ret = __dax_invalidate_mapping_entry(mapping, index, true);
 
-   spin_lock_irq(>tree_lock);
-   entry = get_unlocked_mapping_entry(mapping, index, NULL);
/*
 * This gets called from truncate / punch_hole path. As such, the caller
 * must hold locks protecting against concurrent modifications of the
@@ -469,16 +490,46 @@ int dax_delete_mapping_entry(struct address_space 
*mapping, pgoff_t index)
 * caller has seen exceptional entry for this index, we better find it
 * at that index as well...
 */
-   if (WARN_ON_ONCE(!entry || !radix_tree_exceptional_entry(entry))) {
-   spin_unlock_irq(>tree_lock);
-   return 0;
-   }
-   radix_tree_delete(>page_tree, index);
+   WARN_ON_ONCE(!ret);
+   return ret;
+}
+
+/*
+ * Invalidate exceptional DAX entry if easily possible. This handles DAX
+ * entries for invalidate_inode_pages() so we evict the entry only if we can
+ * do so without blocking.
+ */
+int dax_invalidate_mapping_entry(struct address_space *mapping, pgoff_t index)
+{
+   int ret = 0;
+   void *entry, **slot;
+   struct radix_tree_root *page_tree = >page_tree;
+
+   spin_lock_irq(>tree_lock);
+   entry = __radix_tree_lookup(page_tree, index, NULL, );
+   if (!entry || !radix_tree_exceptional_entry(entry) ||
+   slot_locked(mapping, slot))
+   goto out;
+   if (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
+   radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
+   goto out;
+   radix_tree_delete(page_tree, index);
mapping->nrexceptional--;
+   ret = 1;
+out:
spin_unlock_irq(>tree_lock);
-   dax_wake_mapping_entry_waiter(mapping, index, entry, true);
+   if (ret)
+   dax_wake_mapping_entry_waiter(mapping, index, entry, true);
+   return ret;
+}
 
-   return 1;
+/*
+ * Invalidate exceptional DAX entry if it is clean.
+ */
+int dax_invalidate_clean_mapping_entry(struct address_space *mapping,
+  pgoff_t index)
+{
+   return __dax_invalidate_mapping_entry(mapping, index, false);
 }
 
 /*
diff --git a/include/linux/dax.h b/include/linux/dax.h
index f97bcfe79472..6e36b11285b0 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -41,6 +41,9 @@ ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter 
*iter,
 int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
struct iomap_ops *ops);
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
+int 

[PATCH 0/6 v2] dax: Page invalidation fixes

2016-11-24 Thread Jan Kara
Hello,

this is second revision of my fixes of races when invalidating hole pages in
DAX mappings. See changelogs for details. The series is based on my patches to
write-protect DAX PTEs which are currently carried in mm tree. This is a hard
dependency because we really need to closely track dirtiness (and cleanness!)
of radix tree entries in DAX mappings in order to avoid discarding valid dirty
bits leading to missed cache flushes on fsync(2).

The tests have passed xfstests for xfs and ext4 in DAX and non-DAX mode.

I'd like to get some review of the patches (MM/FS people, please check whether
you like the direction changes in mm/truncate.c take in patch 2/6 - added
Johannes to CC since he was touching related code recently) so that these
patches can land in some tree once DAX write-protection patches are merged.
I'm hoping to get at least first three patches merged for 4.10-rc2... Thanks!

Changes since v1:
* Rebased on top of patches in mm tree
* Added some Reviewed-by tags
* renamed some functions based on review feedback

Honza
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 4/6] dax: Finish fault completely when loading holes

2016-11-24 Thread Jan Kara
The only case when we do not finish the page fault completely is when we
are loading hole pages into a radix tree. Avoid this special case and
finish the fault in that case as well inside the DAX fault handler. It
will allow us for easier iomap handling.

Signed-off-by: Jan Kara 
---
 fs/dax.c | 27 ++-
 1 file changed, 18 insertions(+), 9 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index ddf77ef2ca18..38f996976ebf 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -540,15 +540,16 @@ int dax_invalidate_clean_mapping_entry(struct 
address_space *mapping,
  * otherwise it will simply fall out of the page cache under memory
  * pressure without ever having been dirtied.
  */
-static int dax_load_hole(struct address_space *mapping, void *entry,
+static int dax_load_hole(struct address_space *mapping, void **entry,
 struct vm_fault *vmf)
 {
struct page *page;
+   int ret;
 
/* Hole page already exists? Return it...  */
-   if (!radix_tree_exceptional_entry(entry)) {
-   vmf->page = entry;
-   return VM_FAULT_LOCKED;
+   if (!radix_tree_exceptional_entry(*entry)) {
+   page = *entry;
+   goto out;
}
 
/* This will replace locked radix tree entry with a hole page */
@@ -556,8 +557,17 @@ static int dax_load_hole(struct address_space *mapping, 
void *entry,
   vmf->gfp_mask | __GFP_ZERO);
if (!page)
return VM_FAULT_OOM;
+ out:
vmf->page = page;
-   return VM_FAULT_LOCKED;
+   ret = finish_fault(vmf);
+   vmf->page = NULL;
+   *entry = page;
+   if (!ret) {
+   /* Grab reference for PTE that is now referencing the page */
+   get_page(page);
+   return VM_FAULT_NOPAGE;
+   }
+   return ret;
 }
 
 static int copy_user_dax(struct block_device *bdev, sector_t sector, size_t 
size,
@@ -1162,8 +1172,8 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct 
vm_fault *vmf,
case IOMAP_UNWRITTEN:
case IOMAP_HOLE:
if (!(vmf->flags & FAULT_FLAG_WRITE)) {
-   vmf_ret = dax_load_hole(mapping, entry, vmf);
-   break;
+   vmf_ret = dax_load_hole(mapping, , vmf);
+   goto finish_iomap;
}
/*FALLTHRU*/
default:
@@ -1184,8 +1194,7 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct 
vm_fault *vmf,
}
}
  unlock_entry:
-   if (vmf_ret != VM_FAULT_LOCKED || error)
-   put_locked_mapping_entry(mapping, vmf->pgoff, entry);
+   put_locked_mapping_entry(mapping, vmf->pgoff, entry);
  out:
if (error == -ENOMEM)
return VM_FAULT_OOM | major;
-- 
2.6.6

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-24 Thread Christian König

Am 24.11.2016 um 00:25 schrieb Jason Gunthorpe:

There is certainly nothing about the hardware that cares
about ZONE_DEVICE vs System memory.
Well that is clearly not so simple. When your ZONE_DEVICE pages describe 
a PCI BAR and another PCI device initiates a DMA to this address the DMA 
subsystem must be able to check if the interconnection really works.


E.g. it can happen that PCI device A exports it's BAR using ZONE_DEVICE. 
Not PCI device B (a SATA device) can directly read/write to it because 
it is on the same bus segment, but PCI device C (a network card for 
example) can't because it is on a different bus segment and the bridge 
can't handle P2P transactions.


We need to be able to handle such cases and fall back to bouncing 
buffers, but I don't see that in the DMA subsystem right now.


Regards,
Christian.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 3/6] dax: add tracepoint infrastructure, PMD tracing

2016-11-24 Thread Jan Kara
On Wed 23-11-16 11:44:19, Ross Zwisler wrote:
> Tracepoints are the standard way to capture debugging and tracing
> information in many parts of the kernel, including the XFS and ext4
> filesystems.  Create a tracepoint header for FS DAX and add the first DAX
> tracepoints to the PMD fault handler.  This allows the tracing for DAX to
> be done in the same way as the filesystem tracing so that developers can
> look at them together and get a coherent idea of what the system is doing.
> 
> I added both an entry and exit tracepoint because future patches will add
> tracepoints to child functions of dax_iomap_pmd_fault() like
> dax_pmd_load_hole() and dax_pmd_insert_mapping(). We want those messages to
> be wrapped by the parent function tracepoints so the code flow is more
> easily understood.  Having entry and exit tracepoints for faults also
> allows us to easily see what filesystems functions were called during the
> fault.  These filesystem functions get executed via iomap_begin() and
> iomap_end() calls, for example, and will have their own tracepoints.
> 
> For PMD faults we primarily want to understand the faulting address and
> whether it fell back to 4k faults.  If it fell back to 4k faults the
> tracepoints should let us understand why.
> 
> I named the new tracepoint header file "fs_dax.h" to allow for device DAX
> to have its own separate tracing header in the same directory at some
> point.
> 
> Here is an example output for these events from a successful PMD fault:
> 
> big-2057  [000]    136.396855: dax_pmd_fault: shared mapping write
> address 0x10505000 vm_start 0x1020 vm_end 0x1070 pgoff 0x200
> max_pgoff 0x1400
> 
> big-2057  [000]    136.397943: dax_pmd_fault_done: shared mapping write
> address 0x10505000 vm_start 0x1020 vm_end 0x1070 pgoff 0x200
> max_pgoff 0x1400 NOPAGE
> 
> Signed-off-by: Ross Zwisler 
> Suggested-by: Dave Chinner 

Looks good. Just one minor comment:

> + TP_printk("%s mapping %s address %#lx vm_start %#lx vm_end %#lx "
> + "pgoff %#lx max_pgoff %#lx %s",
> + __entry->vm_flags & VM_SHARED ? "shared" : "private",
> + __entry->flags & FAULT_FLAG_WRITE ? "write" : "read",
> + __entry->address,
> + __entry->vm_start,
> + __entry->vm_end,
> + __entry->pgoff,
> + __entry->max_pgoff,
> + __print_flags(__entry->result, "|", VM_FAULT_RESULT_TRACE)
> + )
> +)

I think it may be useful to dump full 'flags', not just FAULT_FLAG_WRITE...

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 2/6] dax: remove leading space from labels

2016-11-24 Thread Jan Kara
On Wed 23-11-16 11:44:18, Ross Zwisler wrote:
> No functional change.
> 
> As of this commit:
> 
> commit 218dd85887da (".gitattributes: set git diff driver for C source code
> files")
> 
> git-diff and git-format-patch both generate diffs whose hunks are correctly
> prefixed by function names instead of labels, even if those labels aren't
> indented with spaces.

Fine by me. I just have some 4 remaining DAX patches (will send them out
today) and they will clash with this. So I'd prefer if this happened after
they are merged...

Honza
> 
> Signed-off-by: Ross Zwisler 
> ---
>  fs/dax.c | 24 
>  1 file changed, 12 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index d8fe3eb..cc8a069 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -422,7 +422,7 @@ static void *grab_mapping_entry(struct address_space 
> *mapping, pgoff_t index,
>   return page;
>   }
>   entry = lock_slot(mapping, slot);
> - out_unlock:
> +out_unlock:
>   spin_unlock_irq(>tree_lock);
>   return entry;
>  }
> @@ -557,7 +557,7 @@ static int dax_load_hole(struct address_space *mapping, 
> void **entry,
>  vmf->gfp_mask | __GFP_ZERO);
>   if (!page)
>   return VM_FAULT_OOM;
> - out:
> +out:
>   vmf->page = page;
>   ret = finish_fault(vmf);
>   vmf->page = NULL;
> @@ -659,7 +659,7 @@ static void *dax_insert_mapping_entry(struct 
> address_space *mapping,
>   }
>   if (vmf->flags & FAULT_FLAG_WRITE)
>   radix_tree_tag_set(page_tree, index, PAGECACHE_TAG_DIRTY);
> - unlock:
> +unlock:
>   spin_unlock_irq(>tree_lock);
>   if (hole_fill) {
>   radix_tree_preload_end();
> @@ -812,12 +812,12 @@ static int dax_writeback_one(struct block_device *bdev,
>   spin_lock_irq(>tree_lock);
>   radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_DIRTY);
>   spin_unlock_irq(>tree_lock);
> - unmap:
> +unmap:
>   dax_unmap_atomic(bdev, );
>   put_locked_mapping_entry(mapping, index, entry);
>   return ret;
>  
> - put_unlocked:
> +put_unlocked:
>   put_unlocked_mapping_entry(mapping, index, entry2);
>   spin_unlock_irq(>tree_lock);
>   return ret;
> @@ -1193,11 +1193,11 @@ int dax_iomap_fault(struct vm_area_struct *vma, 
> struct vm_fault *vmf,
>   break;
>   }
>  
> - error_unlock_entry:
> +error_unlock_entry:
>   vmf_ret = dax_fault_return(error) | major;
> - unlock_entry:
> +unlock_entry:
>   put_locked_mapping_entry(mapping, vmf->pgoff, entry);
> - finish_iomap:
> +finish_iomap:
>   if (ops->iomap_end) {
>   int copied = PAGE_SIZE;
>  
> @@ -1254,7 +1254,7 @@ static int dax_pmd_insert_mapping(struct vm_area_struct 
> *vma, pmd_t *pmd,
>  
>   return vmf_insert_pfn_pmd(vma, address, pmd, dax.pfn, write);
>  
> - unmap_fallback:
> +unmap_fallback:
>   dax_unmap_atomic(bdev, );
>   return VM_FAULT_FALLBACK;
>  }
> @@ -1378,9 +1378,9 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, 
> unsigned long address,
>   break;
>   }
>  
> - unlock_entry:
> +unlock_entry:
>   put_locked_mapping_entry(mapping, pgoff, entry);
> - finish_iomap:
> +finish_iomap:
>   if (ops->iomap_end) {
>   int copied = PMD_SIZE;
>  
> @@ -1395,7 +1395,7 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, 
> unsigned long address,
>   ops->iomap_end(inode, pos, PMD_SIZE, copied, iomap_flags,
>   );
>   }
> - fallback:
> +fallback:
>   if (result == VM_FAULT_FALLBACK) {
>   split_huge_pmd(vma, pmd, address);
>   count_vm_event(THP_FAULT_FALLBACK);
> -- 
> 2.7.4
> 
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 1/6] dax: fix build breakage with ext4, dax and !iomap

2016-11-24 Thread Jan Kara
On Wed 23-11-16 11:44:17, Ross Zwisler wrote:
> With the current Kconfig setup it is possible to have the following:
> 
> CONFIG_EXT4_FS=y
> CONFIG_FS_DAX=y
> CONFIG_FS_IOMAP=n # this is in fs/Kconfig & isn't user accessible
> 
> With this config we get build failures in ext4_dax_fault() because the
> iomap functions in fs/dax.c are missing:
> 
> fs/built-in.o: In function `ext4_dax_fault':
> file.c:(.text+0x7f3ac): undefined reference to `dax_iomap_fault'
> file.c:(.text+0x7f404): undefined reference to `dax_iomap_fault'
> fs/built-in.o: In function `ext4_file_read_iter':
> file.c:(.text+0x7fc54): undefined reference to `dax_iomap_rw'
> fs/built-in.o: In function `ext4_file_write_iter':
> file.c:(.text+0x7fe9a): undefined reference to `dax_iomap_rw'
> file.c:(.text+0x7feed): undefined reference to `dax_iomap_rw'
> fs/built-in.o: In function `ext4_block_zero_page_range':
> inode.c:(.text+0x85c0d): undefined reference to `iomap_zero_range'
> 
> Now that the struct buffer_head based DAX fault paths and I/O path have
> been removed we really depend on iomap support being present for DAX.  Make
> this explicit by selecting FS_IOMAP if we compile in DAX support.
> 
> Signed-off-by: Ross Zwisler 

I've sent the same patch to Ted yesterday and he will probably queue it on
top of ext4 iomap patches. If it doesn't happen for some reason, feel free
to add:

Reviewed-by: Jan Kara 

Honza

> ---
>  fs/Kconfig  | 1 +
>  fs/dax.c| 2 --
>  fs/ext2/Kconfig | 1 -
>  3 files changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/fs/Kconfig b/fs/Kconfig
> index 8e9e5f41..18024bf 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -38,6 +38,7 @@ config FS_DAX
>   bool "Direct Access (DAX) support"
>   depends on MMU
>   depends on !(ARM || MIPS || SPARC)
> + select FS_IOMAP
>   help
> Direct Access (DAX) can be used on memory-backed block devices.
> If the block device supports DAX and the filesystem supports DAX,
> diff --git a/fs/dax.c b/fs/dax.c
> index be39633..d8fe3eb 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -968,7 +968,6 @@ int __dax_zero_page_range(struct block_device *bdev, 
> sector_t sector,
>  }
>  EXPORT_SYMBOL_GPL(__dax_zero_page_range);
>  
> -#ifdef CONFIG_FS_IOMAP
>  static sector_t dax_iomap_sector(struct iomap *iomap, loff_t pos)
>  {
>   return iomap->blkno + (((pos & PAGE_MASK) - iomap->offset) >> 9);
> @@ -1405,4 +1404,3 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, 
> unsigned long address,
>  }
>  EXPORT_SYMBOL_GPL(dax_iomap_pmd_fault);
>  #endif /* CONFIG_FS_DAX_PMD */
> -#endif /* CONFIG_FS_IOMAP */
> diff --git a/fs/ext2/Kconfig b/fs/ext2/Kconfig
> index 36bea5a..c634874e 100644
> --- a/fs/ext2/Kconfig
> +++ b/fs/ext2/Kconfig
> @@ -1,6 +1,5 @@
>  config EXT2_FS
>   tristate "Second extended fs support"
> - select FS_IOMAP if FS_DAX
>   help
> Ext2 is a standard Linux file system for hard disks.
>  
> -- 
> 2.7.4
> 
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm