Re: [RFC] failure atomic writes for file systems and block devices

2017-03-01 Thread Christoph Hellwig
On Tue, Feb 28, 2017 at 03:22:04PM -0800, Darrick J. Wong wrote:
> (Assuming there's no syncv involved here...?)

No.  While I think we could implement it for XFS similar how we roll
transactions over multiple inodes for a few transactions, the use case
is much more limited, and the potential pitfalls are much bigger.

> > have to check the F_IOINFO fcntl before, which is a bit of a killer.
> > Because of that I've also not implemented any other validity checks
> > yet, as they might make thing even worse when an open on a not supported
> > file system or device fails, but not on an old kernel.  Maybe we need
> > a new open version that checks arguments properly first?
> 
> Does fcntl(F_SETFL...) suffer from this?

Yes.


Re: [RFC] failure atomic writes for file systems and block devices

2017-03-01 Thread Christoph Hellwig
On Tue, Feb 28, 2017 at 03:48:16PM -0500, Chris Mason wrote:
> One thing that isn't clear to me is how we're dealing with boundary bio 
> mappings, which will get submitted by submit_page_section()
>
> sdio->boundary = buffer_boundary(map_bh);

The old dio code is not supported at all by this code at the moment.
We'll either need the new block dev direct I/O code on block
devices (and limit to BIO_MAX_PAGES, this is a bug in this patchset
if people ever have devices with > 1MB atomic write support.  And thanks
to NVMe the failure case is silent, sigh..), or we need file system support
for out of place writes.

>
> In btrfs I'd just chain things together and do the extent pointer swap 
> afterwards, but I didn't follow the XFS code well enough to see how its 
> handled there.  But either way it feels like an error prone surprise 
> waiting for later, and one gap we really want to get right in the FS 
> support is O_ATOMIC across a fragmented extent.
>
> If I'm reading the XFS patches right, the code always cows for atomic.

It doesn't really COW - it uses the COW infrastructure to write out of
place and then commit it into the file later.  Because of that we don't
really care about things like boundary blocks (which XFS never used in
that form anyway) - data is written first, the cache is flushed and then
we swap around the extent pointers.

> Are 
> you planning on adding an optimization to use atomic support in the device 
> to skip COW when possible?

We could do that fairly easily for files that have a contiguous mapping
for the atomic write I/O.  But at this point I have a lot more trust in
the fs code than the devices, especially due to the silent failure mode.


Re: [RFC] failure atomic writes for file systems and block devices

2017-03-01 Thread Christoph Hellwig
On Wed, Mar 01, 2017 at 01:21:41PM +0200, Amir Goldstein wrote:
> [CC += linux-...@vger.kernel.org] for that question and for the new API

We'll need to iterate over the API a few more times first I think..


Re: [RFC] failure atomic writes for file systems and block devices

2017-03-01 Thread Amir Goldstein
On Tue, Feb 28, 2017 at 4:57 PM, Christoph Hellwig  wrote:
> Hi all,
>
> this series implements a new O_ATOMIC flag for failure atomic writes
> to files.   It is based on and tries to unify to earlier proposals,
> the first one for block devices by Chris Mason:
>
> https://lwn.net/Articles/573092/
>
> and the second one for regular files, published by HP Research at
> Usenix FAST 2015:
>
> 
> https://www.usenix.org/conference/fast15/technical-sessions/presentation/verma
>
> It adds a new O_ATOMIC flag for open, which requests writes to be
> failure-atomic, that is either the whole write makes it to persistent
> storage, or none of it, even in case of power of other failures.
>
> There are two implementation various of this:  on block devices O_ATOMIC
> must be combined with O_(D)SYNC so that storage devices that can handle
> large writes atomically can simply do that without any additional work.
> This case is supported by NVMe.
>
> The second case is for file systems, where we simply write new blocks
> out of places and then remap them into the file atomically on either
> completion of an O_(D)SYNC write or when fsync is called explicitly.
>
> The semantics of the latter case are explained in detail at the Usenix
> paper above.
>
> Last but not least a new fcntl is implemented to provide information
> about I/O restrictions such as alignment requirements and the maximum
> atomic write size.
>
> The implementation is simple and clean, but I'm rather unhappy about
> the interface as it has too many failure modes to bullet proof.  For
> one old kernels ignore unknown open flags silently, so applications
> have to check the F_IOINFO fcntl before, which is a bit of a killer.
> Because of that I've also not implemented any other validity checks
> yet, as they might make thing even worse when an open on a not supported
> file system or device fails, but not on an old kernel.  Maybe we need
> a new open version that checks arguments properly first?
>

[CC += linux-...@vger.kernel.org] for that question and for the new API

> Also I'm really worried about the NVMe failure modes - devices simply
> advertise an atomic write size, with no way for the device to know
> that the host requested a given write to be atomic, and thus no
> error reporting.  This is made worse by NVMe 1.2 adding per-namespace
> atomic I/O parameters that devices can use to introduce additional
> odd alignment quirks - while there is some language in the spec
> requiring them not to weaken the per-controller guarantees it all
> looks rather weak and I'm not too confident in all implementations
> getting everything right.
>
> Last but not least this depends on a few XFS patches, so to actually
> apply / run the patches please use this git tree:
>
> git://git.infradead.org/users/hch/vfs.git O_ATOMIC
>
> Gitweb:
>
> http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/O_ATOMIC


Re: [RFC] failure atomic writes for file systems and block devices

2017-02-28 Thread Chris Mason



On 02/28/2017 09:57 AM, Christoph Hellwig wrote:

Hi all,

this series implements a new O_ATOMIC flag for failure atomic writes
to files.   It is based on and tries to unify to earlier proposals,
the first one for block devices by Chris Mason:


https://urldefense.proofpoint.com/v2/url?u=https-3A__lwn.net_Articles_573092_&d=DwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=9QPtTAxcitoznaWRKKHoEQ&m=P5byIhbDCF-kdlNpZVpxMKG3E36-cQ-lK27coqUFUng&s=rqXtuRMvf2rijHel_VAiO-KQ8AtQ5DXEI2obnCI_ljQ&e=

and the second one for regular files, published by HP Research at
Usenix FAST 2015:


https://urldefense.proofpoint.com/v2/url?u=https-3A__www.usenix.org_conference_fast15_technical-2Dsessions_presentation_verma&d=DwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=9QPtTAxcitoznaWRKKHoEQ&m=P5byIhbDCF-kdlNpZVpxMKG3E36-cQ-lK27coqUFUng&s=ilnrrNs8nG4_UV2xx7tc2Efm20d2Wa8PHoJE8WUTCwI&e=

It adds a new O_ATOMIC flag for open, which requests writes to be
failure-atomic, that is either the whole write makes it to persistent
storage, or none of it, even in case of power of other failures.

There are two implementation various of this:  on block devices O_ATOMIC
must be combined with O_(D)SYNC so that storage devices that can handle
large writes atomically can simply do that without any additional work.
This case is supported by NVMe.



Hi Christoph,

This is great, and supporting code in both dio and bio get rid of some 
of the warts from when I tried.  The DIO_PAGES define used to be an 
upper limit on the max contiguous bio that would get built, but that's 
much better now.


One thing that isn't clear to me is how we're dealing with boundary bio 
mappings, which will get submitted by submit_page_section()


sdio->boundary = buffer_boundary(map_bh);

In btrfs I'd just chain things together and do the extent pointer swap 
afterwards, but I didn't follow the XFS code well enough to see how its 
handled there.  But either way it feels like an error prone surprise 
waiting for later, and one gap we really want to get right in the FS 
support is O_ATOMIC across a fragmented extent.


If I'm reading the XFS patches right, the code always cows for atomic. 
Are you planning on adding an optimization to use atomic support in the 
device to skip COW when possible?


To turn off mysql double buffering, we only need 16K or 64K writes, 
which most of the time you'd be able to pass down directly without cows.


-chris


Re: [RFC] failure atomic writes for file systems and block devices

2017-02-28 Thread Darrick J. Wong
On Tue, Feb 28, 2017 at 06:57:25AM -0800, Christoph Hellwig wrote:
> Hi all,
> 
> this series implements a new O_ATOMIC flag for failure atomic writes
> to files.   It is based on and tries to unify to earlier proposals,
> the first one for block devices by Chris Mason:
> 
>   https://lwn.net/Articles/573092/
> 
> and the second one for regular files, published by HP Research at
> Usenix FAST 2015:
> 
>   
> https://www.usenix.org/conference/fast15/technical-sessions/presentation/verma
> 
> It adds a new O_ATOMIC flag for open, which requests writes to be
> failure-atomic, that is either the whole write makes it to persistent
> storage, or none of it, even in case of power of other failures.
> 
> There are two implementation various of this:  on block devices O_ATOMIC
> must be combined with O_(D)SYNC so that storage devices that can handle
> large writes atomically can simply do that without any additional work.
> This case is supported by NVMe.
> 
> The second case is for file systems, where we simply write new blocks
> out of places and then remap them into the file atomically on either
> completion of an O_(D)SYNC write or when fsync is called explicitly.
> 
> The semantics of the latter case are explained in detail at the Usenix
> paper above.

(Assuming there's no syncv involved here...?)

> Last but not least a new fcntl is implemented to provide information
> about I/O restrictions such as alignment requirements and the maximum
> atomic write size.
> 
> The implementation is simple and clean, but I'm rather unhappy about
> the interface as it has too many failure modes to bullet proof.  For
> one old kernels ignore unknown open flags silently, so applications

Ok, heh, disregard my review comment (for the xfs part) about the
seemingly insufficient O_ATOMIC validation.

> have to check the F_IOINFO fcntl before, which is a bit of a killer.
> Because of that I've also not implemented any other validity checks
> yet, as they might make thing even worse when an open on a not supported
> file system or device fails, but not on an old kernel.  Maybe we need
> a new open version that checks arguments properly first?

Does fcntl(F_SETFL...) suffer from this?

> Also I'm really worried about the NVMe failure modes - devices simply
> advertise an atomic write size, with no way for the device to know
> that the host requested a given write to be atomic, and thus no
> error reporting.

Yikes!

> This is made worse by NVMe 1.2 adding per-namespace
> atomic I/O parameters that devices can use to introduce additional
> odd alignment quirks - while there is some language in the spec
> requiring them not to weaken the per-controller guarantees it all
> looks rather weak and I'm not too confident in all implementations
> getting everything right.
> 
> Last but not least this depends on a few XFS patches, so to actually
> apply / run the patches please use this git tree:

Well, the XFS parts don't look too bad

--D

> 
> git://git.infradead.org/users/hch/vfs.git O_ATOMIC
> 
> Gitweb:
> 
> http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/O_ATOMIC


[RFC] failure atomic writes for file systems and block devices

2017-02-28 Thread Christoph Hellwig
Hi all,

this series implements a new O_ATOMIC flag for failure atomic writes
to files.   It is based on and tries to unify to earlier proposals,
the first one for block devices by Chris Mason:

https://lwn.net/Articles/573092/

and the second one for regular files, published by HP Research at
Usenix FAST 2015:


https://www.usenix.org/conference/fast15/technical-sessions/presentation/verma

It adds a new O_ATOMIC flag for open, which requests writes to be
failure-atomic, that is either the whole write makes it to persistent
storage, or none of it, even in case of power of other failures.

There are two implementation various of this:  on block devices O_ATOMIC
must be combined with O_(D)SYNC so that storage devices that can handle
large writes atomically can simply do that without any additional work.
This case is supported by NVMe.

The second case is for file systems, where we simply write new blocks
out of places and then remap them into the file atomically on either
completion of an O_(D)SYNC write or when fsync is called explicitly.

The semantics of the latter case are explained in detail at the Usenix
paper above.

Last but not least a new fcntl is implemented to provide information
about I/O restrictions such as alignment requirements and the maximum
atomic write size.

The implementation is simple and clean, but I'm rather unhappy about
the interface as it has too many failure modes to bullet proof.  For
one old kernels ignore unknown open flags silently, so applications
have to check the F_IOINFO fcntl before, which is a bit of a killer.
Because of that I've also not implemented any other validity checks
yet, as they might make thing even worse when an open on a not supported
file system or device fails, but not on an old kernel.  Maybe we need
a new open version that checks arguments properly first?

Also I'm really worried about the NVMe failure modes - devices simply
advertise an atomic write size, with no way for the device to know
that the host requested a given write to be atomic, and thus no
error reporting.  This is made worse by NVMe 1.2 adding per-namespace
atomic I/O parameters that devices can use to introduce additional
odd alignment quirks - while there is some language in the spec
requiring them not to weaken the per-controller guarantees it all
looks rather weak and I'm not too confident in all implementations
getting everything right.

Last but not least this depends on a few XFS patches, so to actually
apply / run the patches please use this git tree:

git://git.infradead.org/users/hch/vfs.git O_ATOMIC

Gitweb:

http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/O_ATOMIC