Re: [PATCH v1 0/8] VFS: In-kernel copy system call

2015-09-14 Thread Andy Lutomirski
On Sep 13, 2015 4:25 PM, "Dave Chinner"  wrote:
>
> On Tue, Sep 08, 2015 at 04:08:43PM -0700, Andy Lutomirski wrote:
> > Can we have a clean way to figure out whether two file ranges are the
> > same in a way that allows false negatives?  I.e. return 1 if the
> > ranges are reflinks of each other and 0 if not?  Pretty please?  I've
> > implemented that in the past on btrfs by syncing the ranges and then
> > comparing FIEMAP output, but that's hideous.
>
> That fundamentally doesn't work for userspace, because the moment
> the filesystem drops it's locks on the inodes in the kernel after
> doing the comparison the mappings can change.  IOWs, by the time the
> information gets back to userspace, it's already wrong. e.g. cp made
> this mistake by trying to use FIEMAP to optimise hole detection in
> files and ended up with corrupt copies.
>
> It really doesn't matter what the syscall/ioctl interface is, trying
> to make application logic decisions based on inode block mappings
> from userspace is racy and not safe and will go wrong...
>

I agree, and that thing was just an experiment.  I'd love to see a
sane and correct interface, though.


--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 0/8] VFS: In-kernel copy system call

2015-09-13 Thread Dave Chinner
On Tue, Sep 08, 2015 at 04:08:43PM -0700, Andy Lutomirski wrote:
> Can we have a clean way to figure out whether two file ranges are the
> same in a way that allows false negatives?  I.e. return 1 if the
> ranges are reflinks of each other and 0 if not?  Pretty please?  I've
> implemented that in the past on btrfs by syncing the ranges and then
> comparing FIEMAP output, but that's hideous.

That fundamentally doesn't work for userspace, because the moment
the filesystem drops it's locks on the inodes in the kernel after
doing the comparison the mappings can change.  IOWs, by the time the
information gets back to userspace, it's already wrong. e.g. cp made
this mistake by trying to use FIEMAP to optimise hole detection in
files and ended up with corrupt copies.

It really doesn't matter what the syscall/ioctl interface is, trying
to make application logic decisions based on inode block mappings
from userspace is racy and not safe and will go wrong...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 0/8] VFS: In-kernel copy system call

2015-09-10 Thread Austin S Hemmelgarn

On 2015-09-09 14:52, Anna Schumaker wrote:

On 09/08/2015 06:39 PM, Darrick J. Wong wrote:

On Tue, Sep 08, 2015 at 02:45:39PM -0700, Andy Lutomirski wrote:

On Tue, Sep 8, 2015 at 2:29 PM, Darrick J. Wong  wrote:

On Tue, Sep 08, 2015 at 09:03:09PM +0100, Pádraig Brady wrote:

On 08/09/15 20:10, Andy Lutomirski wrote:

On Tue, Sep 8, 2015 at 11:23 AM, Anna Schumaker
 wrote:

On 09/08/2015 11:21 AM, Pádraig Brady wrote:

I see copy_file_range() is a reflink() on BTRFS?
That's a bit surprising, as it avoids the copy completely.
cp(1) for example considered doing a BTRFS clone by default,
but didn't due to expectations that users actually wanted
the data duplicated on disk for resilience reasons,
and for performance reasons so that write latencies were
restricted to the copy operation, rather than being
introduced at usage time as the dest file is CoW'd.

If reflink() is a possibility for copy_file_range()
then could it be done optionally with a flag?


The idea is that filesystems get to choose how to handle copies in the
default case.  BTRFS could do a reflink, but NFS could do a server side


Eww, different default behaviors depending on the filesystem. :)


copy instead.  I can change the default behavior to only do a data copy
(unless the reflink flag is specified) instead, if that is desirable.

What does everybody think?


I think the best you could do is to have a hint asking politely for
the data to be deep-copied.  After all, some filesystems reserve the
right to transparently deduplicate.

Also, on a true COW filesystem (e.g. btrfs sometimes), there may be no
advantage to deep copying unless you actually want two copies for
locality reasons.


Agreed. The relink and server side copy are separate things.
There's no advantage to not doing a server side copy,
but as mentioned there may be advantages to doing deep copies on BTRFS
(another reason not previous mentioned in this thread, would be
to avoid ENOSPC errors at some time in the future).

So having control over the deep copy seems useful.
It's debatable whether ALLOW_REFLINK should be on/off by default
for copy_file_range().  I'd be inclined to have such a setting off by default,
but cp(1) at least will work with whatever is chosen.


So far it looks like people are interested in at least these "make data appear
in this other place" filesystem operations:

1. reflink
2. reflink, but only if the contents are the same (dedupe)


What I meant by this was: if you ask for "regular copy", you may end
up with a reflink anyway.  Anyway, how can you reflink a range and
have the contents *not* be the same?


reflink forcibly remaps fd_dest's range to fd_src's range.  If they didn't
match before, they will afterwards.

dedupe remaps fd_dest's range to fd_src's range only if they match, of course.

Perhaps I should have said "...if the contents are the same before the call"?




3. regular copy
4. regular copy, but make the hardware do it for us
5. regular copy, but require a second copy on the media (no-dedupe)


If this comes from me, I have no desire to ever use this as a flag.


I meant (5) as a "disable auto-dedupe for this operation" flag, not as
a "reallocate all the shared blocks now" op...


If someone wants to use chattr or some new operation to say "make this
range of this file belong just to me for purpose of optimizing future
writes", then sure, go for it, with the understanding that there are
plenty of filesystems for which that doesn't even make sense.


"Unshare these blocks" sounds more like something fallocate could do.

So far in my XFS reflink playground, it seems that using the defrag tool to
un-cow a file makes most sense.  AFAICT the XFS and ext4 defraggers copy a
fragmented file's data to a second file and use a 'swap extents' operation,
after which the donor file is unlinked.

Hey, if this syscall turns into a more generic "do something involving two
(fd:off:len) (fd:off:len) tuples" call, I guess we could throw in "swap
extents" as a 7th operation, to refactor the ioctls.  




6. regular copy, but don't CoW (eatmyothercopies) (joke)

(Please add whatever ops I missed.)

I think I can see a case for letting (4) fall back to (3) since (4) is an
optimization of (3).

However, I particularly don't like the idea of (1) falling back to (3-5).
Either the kernel can satisfy a request or it can't, but let's not just
assume that we should transmogrify one type of request into another.  Userspace
should decide if a reflink failure should turn into one of the copy variants,
depending on whether the user wants to spread allocation costs over rewrites or
pay it all up front.  Also, if we allow reflink to fall back to copy, how do
programs find out what actually took place?  Or do we simply not allow them to
find out?

Also, programs that expect reflink either to finish or fail quickly might be
surprised if it's possible for reflink to take a longer time than usual and
with the side effect 

Re: [PATCH v1 0/8] VFS: In-kernel copy system call

2015-09-10 Thread Anna Schumaker
On 09/09/2015 05:16 PM, Darrick J. Wong wrote:
> On Wed, Sep 09, 2015 at 02:52:08PM -0400, Anna Schumaker wrote:
>> On 09/08/2015 06:39 PM, Darrick J. Wong wrote:
>>> On Tue, Sep 08, 2015 at 02:45:39PM -0700, Andy Lutomirski wrote:
 On Tue, Sep 8, 2015 at 2:29 PM, Darrick J. Wong  
 wrote:
> On Tue, Sep 08, 2015 at 09:03:09PM +0100, Pádraig Brady wrote:
>> On 08/09/15 20:10, Andy Lutomirski wrote:
>>> On Tue, Sep 8, 2015 at 11:23 AM, Anna Schumaker
>>>  wrote:
 On 09/08/2015 11:21 AM, Pádraig Brady wrote:
> I see copy_file_range() is a reflink() on BTRFS?
> That's a bit surprising, as it avoids the copy completely.
> cp(1) for example considered doing a BTRFS clone by default,
> but didn't due to expectations that users actually wanted
> the data duplicated on disk for resilience reasons,
> and for performance reasons so that write latencies were
> restricted to the copy operation, rather than being
> introduced at usage time as the dest file is CoW'd.
>
> If reflink() is a possibility for copy_file_range()
> then could it be done optionally with a flag?

 The idea is that filesystems get to choose how to handle copies in the
 default case.  BTRFS could do a reflink, but NFS could do a server side
>
> Eww, different default behaviors depending on the filesystem. :)
>
 copy instead.  I can change the default behavior to only do a data copy
 (unless the reflink flag is specified) instead, if that is desirable.

 What does everybody think?
>>>
>>> I think the best you could do is to have a hint asking politely for
>>> the data to be deep-copied.  After all, some filesystems reserve the
>>> right to transparently deduplicate.
>>>
>>> Also, on a true COW filesystem (e.g. btrfs sometimes), there may be no
>>> advantage to deep copying unless you actually want two copies for
>>> locality reasons.
>>
>> Agreed. The relink and server side copy are separate things.
>> There's no advantage to not doing a server side copy,
>> but as mentioned there may be advantages to doing deep copies on BTRFS
>> (another reason not previous mentioned in this thread, would be
>> to avoid ENOSPC errors at some time in the future).
>>
>> So having control over the deep copy seems useful.
>> It's debatable whether ALLOW_REFLINK should be on/off by default
>> for copy_file_range().  I'd be inclined to have such a setting off by 
>> default,
>> but cp(1) at least will work with whatever is chosen.
>
> So far it looks like people are interested in at least these "make data 
> appear
> in this other place" filesystem operations:
>
> 1. reflink
> 2. reflink, but only if the contents are the same (dedupe)

 What I meant by this was: if you ask for "regular copy", you may end
 up with a reflink anyway.  Anyway, how can you reflink a range and
 have the contents *not* be the same?
>>>
>>> reflink forcibly remaps fd_dest's range to fd_src's range.  If they didn't
>>> match before, they will afterwards.
>>>
>>> dedupe remaps fd_dest's range to fd_src's range only if they match, of 
>>> course.
>>>
>>> Perhaps I should have said "...if the contents are the same before the 
>>> call"?
>>>

> 3. regular copy
> 4. regular copy, but make the hardware do it for us
> 5. regular copy, but require a second copy on the media (no-dedupe)

 If this comes from me, I have no desire to ever use this as a flag.
>>>
>>> I meant (5) as a "disable auto-dedupe for this operation" flag, not as
>>> a "reallocate all the shared blocks now" op...
>>>
 If someone wants to use chattr or some new operation to say "make this
 range of this file belong just to me for purpose of optimizing future
 writes", then sure, go for it, with the understanding that there are
 plenty of filesystems for which that doesn't even make sense.
>>>
>>> "Unshare these blocks" sounds more like something fallocate could do.
>>>
>>> So far in my XFS reflink playground, it seems that using the defrag tool to
>>> un-cow a file makes most sense.  AFAICT the XFS and ext4 defraggers copy a
>>> fragmented file's data to a second file and use a 'swap extents' operation,
>>> after which the donor file is unlinked.
>>>
>>> Hey, if this syscall turns into a more generic "do something involving two
>>> (fd:off:len) (fd:off:len) tuples" call, I guess we could throw in "swap
>>> extents" as a 7th operation, to refactor the ioctls.  
>>>

> 6. regular copy, but don't CoW (eatmyothercopies) (joke)
>
> (Please add whatever ops I missed.)
>
> I think I can see a case for letting (4) fall back to (3) since (4) is an
> optimization of (3).
>
> However, I particularly don't like the 

Re: [PATCH v1 0/8] VFS: In-kernel copy system call

2015-09-10 Thread Austin S Hemmelgarn

On 2015-09-10 11:10, Anna Schumaker wrote:

On 09/09/2015 05:16 PM, Darrick J. Wong wrote:

On Wed, Sep 09, 2015 at 02:52:08PM -0400, Anna Schumaker wrote:

On 09/08/2015 06:39 PM, Darrick J. Wong wrote:

On Tue, Sep 08, 2015 at 02:45:39PM -0700, Andy Lutomirski wrote:

On Tue, Sep 8, 2015 at 2:29 PM, Darrick J. Wong  wrote:

On Tue, Sep 08, 2015 at 09:03:09PM +0100, Pádraig Brady wrote:

On 08/09/15 20:10, Andy Lutomirski wrote:

On Tue, Sep 8, 2015 at 11:23 AM, Anna Schumaker
 wrote:

On 09/08/2015 11:21 AM, Pádraig Brady wrote:

I see copy_file_range() is a reflink() on BTRFS?
That's a bit surprising, as it avoids the copy completely.
cp(1) for example considered doing a BTRFS clone by default,
but didn't due to expectations that users actually wanted
the data duplicated on disk for resilience reasons,
and for performance reasons so that write latencies were
restricted to the copy operation, rather than being
introduced at usage time as the dest file is CoW'd.

If reflink() is a possibility for copy_file_range()
then could it be done optionally with a flag?


The idea is that filesystems get to choose how to handle copies in the
default case.  BTRFS could do a reflink, but NFS could do a server side


Eww, different default behaviors depending on the filesystem. :)


copy instead.  I can change the default behavior to only do a data copy
(unless the reflink flag is specified) instead, if that is desirable.

What does everybody think?


I think the best you could do is to have a hint asking politely for
the data to be deep-copied.  After all, some filesystems reserve the
right to transparently deduplicate.

Also, on a true COW filesystem (e.g. btrfs sometimes), there may be no
advantage to deep copying unless you actually want two copies for
locality reasons.


Agreed. The relink and server side copy are separate things.
There's no advantage to not doing a server side copy,
but as mentioned there may be advantages to doing deep copies on BTRFS
(another reason not previous mentioned in this thread, would be
to avoid ENOSPC errors at some time in the future).

So having control over the deep copy seems useful.
It's debatable whether ALLOW_REFLINK should be on/off by default
for copy_file_range().  I'd be inclined to have such a setting off by default,
but cp(1) at least will work with whatever is chosen.


So far it looks like people are interested in at least these "make data appear
in this other place" filesystem operations:

1. reflink
2. reflink, but only if the contents are the same (dedupe)


What I meant by this was: if you ask for "regular copy", you may end
up with a reflink anyway.  Anyway, how can you reflink a range and
have the contents *not* be the same?


reflink forcibly remaps fd_dest's range to fd_src's range.  If they didn't
match before, they will afterwards.

dedupe remaps fd_dest's range to fd_src's range only if they match, of course.

Perhaps I should have said "...if the contents are the same before the call"?




3. regular copy
4. regular copy, but make the hardware do it for us
5. regular copy, but require a second copy on the media (no-dedupe)


If this comes from me, I have no desire to ever use this as a flag.


I meant (5) as a "disable auto-dedupe for this operation" flag, not as
a "reallocate all the shared blocks now" op...


If someone wants to use chattr or some new operation to say "make this
range of this file belong just to me for purpose of optimizing future
writes", then sure, go for it, with the understanding that there are
plenty of filesystems for which that doesn't even make sense.


"Unshare these blocks" sounds more like something fallocate could do.

So far in my XFS reflink playground, it seems that using the defrag tool to
un-cow a file makes most sense.  AFAICT the XFS and ext4 defraggers copy a
fragmented file's data to a second file and use a 'swap extents' operation,
after which the donor file is unlinked.

Hey, if this syscall turns into a more generic "do something involving two
(fd:off:len) (fd:off:len) tuples" call, I guess we could throw in "swap
extents" as a 7th operation, to refactor the ioctls.  




6. regular copy, but don't CoW (eatmyothercopies) (joke)

(Please add whatever ops I missed.)

I think I can see a case for letting (4) fall back to (3) since (4) is an
optimization of (3).

However, I particularly don't like the idea of (1) falling back to (3-5).
Either the kernel can satisfy a request or it can't, but let's not just
assume that we should transmogrify one type of request into another.  Userspace
should decide if a reflink failure should turn into one of the copy variants,
depending on whether the user wants to spread allocation costs over rewrites or
pay it all up front.  Also, if we allow reflink to fall back to copy, how do
programs find out what actually took place?  Or do we simply not allow them to
find out?

Also, programs that expect reflink either to finish or fail 

Re: [PATCH v1 0/8] VFS: In-kernel copy system call

2015-09-09 Thread Anna Schumaker
On 09/08/2015 06:39 PM, Darrick J. Wong wrote:
> On Tue, Sep 08, 2015 at 02:45:39PM -0700, Andy Lutomirski wrote:
>> On Tue, Sep 8, 2015 at 2:29 PM, Darrick J. Wong  
>> wrote:
>>> On Tue, Sep 08, 2015 at 09:03:09PM +0100, Pádraig Brady wrote:
 On 08/09/15 20:10, Andy Lutomirski wrote:
> On Tue, Sep 8, 2015 at 11:23 AM, Anna Schumaker
>  wrote:
>> On 09/08/2015 11:21 AM, Pádraig Brady wrote:
>>> I see copy_file_range() is a reflink() on BTRFS?
>>> That's a bit surprising, as it avoids the copy completely.
>>> cp(1) for example considered doing a BTRFS clone by default,
>>> but didn't due to expectations that users actually wanted
>>> the data duplicated on disk for resilience reasons,
>>> and for performance reasons so that write latencies were
>>> restricted to the copy operation, rather than being
>>> introduced at usage time as the dest file is CoW'd.
>>>
>>> If reflink() is a possibility for copy_file_range()
>>> then could it be done optionally with a flag?
>>
>> The idea is that filesystems get to choose how to handle copies in the
>> default case.  BTRFS could do a reflink, but NFS could do a server side
>>>
>>> Eww, different default behaviors depending on the filesystem. :)
>>>
>> copy instead.  I can change the default behavior to only do a data copy
>> (unless the reflink flag is specified) instead, if that is desirable.
>>
>> What does everybody think?
>
> I think the best you could do is to have a hint asking politely for
> the data to be deep-copied.  After all, some filesystems reserve the
> right to transparently deduplicate.
>
> Also, on a true COW filesystem (e.g. btrfs sometimes), there may be no
> advantage to deep copying unless you actually want two copies for
> locality reasons.

 Agreed. The relink and server side copy are separate things.
 There's no advantage to not doing a server side copy,
 but as mentioned there may be advantages to doing deep copies on BTRFS
 (another reason not previous mentioned in this thread, would be
 to avoid ENOSPC errors at some time in the future).

 So having control over the deep copy seems useful.
 It's debatable whether ALLOW_REFLINK should be on/off by default
 for copy_file_range().  I'd be inclined to have such a setting off by 
 default,
 but cp(1) at least will work with whatever is chosen.
>>>
>>> So far it looks like people are interested in at least these "make data 
>>> appear
>>> in this other place" filesystem operations:
>>>
>>> 1. reflink
>>> 2. reflink, but only if the contents are the same (dedupe)
>>
>> What I meant by this was: if you ask for "regular copy", you may end
>> up with a reflink anyway.  Anyway, how can you reflink a range and
>> have the contents *not* be the same?
> 
> reflink forcibly remaps fd_dest's range to fd_src's range.  If they didn't
> match before, they will afterwards.
> 
> dedupe remaps fd_dest's range to fd_src's range only if they match, of course.
> 
> Perhaps I should have said "...if the contents are the same before the call"?
> 
>>
>>> 3. regular copy
>>> 4. regular copy, but make the hardware do it for us
>>> 5. regular copy, but require a second copy on the media (no-dedupe)
>>
>> If this comes from me, I have no desire to ever use this as a flag.
> 
> I meant (5) as a "disable auto-dedupe for this operation" flag, not as
> a "reallocate all the shared blocks now" op...
> 
>> If someone wants to use chattr or some new operation to say "make this
>> range of this file belong just to me for purpose of optimizing future
>> writes", then sure, go for it, with the understanding that there are
>> plenty of filesystems for which that doesn't even make sense.
> 
> "Unshare these blocks" sounds more like something fallocate could do.
> 
> So far in my XFS reflink playground, it seems that using the defrag tool to
> un-cow a file makes most sense.  AFAICT the XFS and ext4 defraggers copy a
> fragmented file's data to a second file and use a 'swap extents' operation,
> after which the donor file is unlinked.
> 
> Hey, if this syscall turns into a more generic "do something involving two
> (fd:off:len) (fd:off:len) tuples" call, I guess we could throw in "swap
> extents" as a 7th operation, to refactor the ioctls.  
> 
>>
>>> 6. regular copy, but don't CoW (eatmyothercopies) (joke)
>>>
>>> (Please add whatever ops I missed.)
>>>
>>> I think I can see a case for letting (4) fall back to (3) since (4) is an
>>> optimization of (3).
>>>
>>> However, I particularly don't like the idea of (1) falling back to (3-5).
>>> Either the kernel can satisfy a request or it can't, but let's not just
>>> assume that we should transmogrify one type of request into another.  
>>> Userspace
>>> should decide if a reflink failure should turn into one of the copy 
>>> variants,
>>> depending on whether the user 

Re: [PATCH v1 0/8] VFS: In-kernel copy system call

2015-09-09 Thread Trond Myklebust
On Wed, Sep 9, 2015 at 4:09 PM, Chris Mason  wrote:
> On Tue, Sep 08, 2015 at 04:08:43PM -0700, Andy Lutomirski wrote:
>> On Tue, Sep 8, 2015 at 3:39 PM, Darrick J. Wong  
>> wrote:
>> > On Tue, Sep 08, 2015 at 02:45:39PM -0700, Andy Lutomirski wrote:
>> >> What I meant by this was: if you ask for "regular copy", you may end
>> >> up with a reflink anyway.  Anyway, how can you reflink a range and
>> >> have the contents *not* be the same?
>> >
>> > reflink forcibly remaps fd_dest's range to fd_src's range.  If they didn't
>> > match before, they will afterwards.
>> >
>> > dedupe remaps fd_dest's range to fd_src's range only if they match, of 
>> > course.
>> >
>> > Perhaps I should have said "...if the contents are the same before the 
>> > call"?
>> >
>>
>> Oh, I see.
>>
>> Can we have a clean way to figure out whether two file ranges are the
>> same in a way that allows false negatives?  I.e. return 1 if the
>> ranges are reflinks of each other and 0 if not?  Pretty please?  I've
>> implemented that in the past on btrfs by syncing the ranges and then
>> comparing FIEMAP output, but that's hideous.
>
> I'd almost rather have a separate call, maybe unshare_file_range()?
>

Doesn't it make more sense to put that functionality in fallocate()?

Cheers
  Trond
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 0/8] VFS: In-kernel copy system call

2015-09-09 Thread Andy Lutomirski
On Wed, Sep 9, 2015 at 1:09 PM, Chris Mason  wrote:
> On Tue, Sep 08, 2015 at 04:08:43PM -0700, Andy Lutomirski wrote:
>> On Tue, Sep 8, 2015 at 3:39 PM, Darrick J. Wong  
>> wrote:
>> > On Tue, Sep 08, 2015 at 02:45:39PM -0700, Andy Lutomirski wrote:
>> >> What I meant by this was: if you ask for "regular copy", you may end
>> >> up with a reflink anyway.  Anyway, how can you reflink a range and
>> >> have the contents *not* be the same?
>> >
>> > reflink forcibly remaps fd_dest's range to fd_src's range.  If they didn't
>> > match before, they will afterwards.
>> >
>> > dedupe remaps fd_dest's range to fd_src's range only if they match, of 
>> > course.
>> >
>> > Perhaps I should have said "...if the contents are the same before the 
>> > call"?
>> >
>>
>> Oh, I see.
>>
>> Can we have a clean way to figure out whether two file ranges are the
>> same in a way that allows false negatives?  I.e. return 1 if the
>> ranges are reflinks of each other and 0 if not?  Pretty please?  I've
>> implemented that in the past on btrfs by syncing the ranges and then
>> comparing FIEMAP output, but that's hideous.
>
> I'd almost rather have a separate call, maybe unshare_file_range()?
>
> Is that the end goal to the sharing check?

My use case was archival.  I can reflink data between a working copy
and some archived copy and then I can very efficiently tell if the
working copy has been changed by checking if the reflink is still
linked.

It would be even better if I could enumerate which parts of one file
match which parts of another file.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 0/8] VFS: In-kernel copy system call

2015-09-09 Thread Chris Mason
On Wed, Sep 09, 2015 at 04:26:58PM -0400, Trond Myklebust wrote:
> On Wed, Sep 9, 2015 at 4:09 PM, Chris Mason  wrote:
> > On Tue, Sep 08, 2015 at 04:08:43PM -0700, Andy Lutomirski wrote:
> >> On Tue, Sep 8, 2015 at 3:39 PM, Darrick J. Wong  
> >> wrote:
> >> > On Tue, Sep 08, 2015 at 02:45:39PM -0700, Andy Lutomirski wrote:
> >> >> What I meant by this was: if you ask for "regular copy", you may end
> >> >> up with a reflink anyway.  Anyway, how can you reflink a range and
> >> >> have the contents *not* be the same?
> >> >
> >> > reflink forcibly remaps fd_dest's range to fd_src's range.  If they 
> >> > didn't
> >> > match before, they will afterwards.
> >> >
> >> > dedupe remaps fd_dest's range to fd_src's range only if they match, of 
> >> > course.
> >> >
> >> > Perhaps I should have said "...if the contents are the same before the 
> >> > call"?
> >> >
> >>
> >> Oh, I see.
> >>
> >> Can we have a clean way to figure out whether two file ranges are the
> >> same in a way that allows false negatives?  I.e. return 1 if the
> >> ranges are reflinks of each other and 0 if not?  Pretty please?  I've
> >> implemented that in the past on btrfs by syncing the ranges and then
> >> comparing FIEMAP output, but that's hideous.
> >
> > I'd almost rather have a separate call, maybe unshare_file_range()?
> >
> 
> Doesn't it make more sense to put that functionality in fallocate()?

That works too, I'm just hoping to keep the copy_file_range stuff
simple.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 0/8] VFS: In-kernel copy system call

2015-09-09 Thread Anna Schumaker
On 09/09/2015 04:38 PM, Chris Mason wrote:
> On Wed, Sep 09, 2015 at 04:26:58PM -0400, Trond Myklebust wrote:
>> On Wed, Sep 9, 2015 at 4:09 PM, Chris Mason  wrote:
>>> On Tue, Sep 08, 2015 at 04:08:43PM -0700, Andy Lutomirski wrote:
 On Tue, Sep 8, 2015 at 3:39 PM, Darrick J. Wong  
 wrote:
> On Tue, Sep 08, 2015 at 02:45:39PM -0700, Andy Lutomirski wrote:
>> What I meant by this was: if you ask for "regular copy", you may end
>> up with a reflink anyway.  Anyway, how can you reflink a range and
>> have the contents *not* be the same?
>
> reflink forcibly remaps fd_dest's range to fd_src's range.  If they didn't
> match before, they will afterwards.
>
> dedupe remaps fd_dest's range to fd_src's range only if they match, of 
> course.
>
> Perhaps I should have said "...if the contents are the same before the 
> call"?
>

 Oh, I see.

 Can we have a clean way to figure out whether two file ranges are the
 same in a way that allows false negatives?  I.e. return 1 if the
 ranges are reflinks of each other and 0 if not?  Pretty please?  I've
 implemented that in the past on btrfs by syncing the ranges and then
 comparing FIEMAP output, but that's hideous.
>>>
>>> I'd almost rather have a separate call, maybe unshare_file_range()?
>>>
>>
>> Doesn't it make more sense to put that functionality in fallocate()?
> 
> That works too, I'm just hoping to keep the copy_file_range stuff
> simple.

I agree with keeping copy_file_range() simple, especially for the initial 
merge.  Extra stuff can always be added in later :)

Anna

> 
> -chris
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 0/8] VFS: In-kernel copy system call

2015-09-09 Thread Chris Mason
On Wed, Sep 09, 2015 at 01:37:44PM -0700, Andy Lutomirski wrote:
> On Wed, Sep 9, 2015 at 1:09 PM, Chris Mason  wrote:
> > On Tue, Sep 08, 2015 at 04:08:43PM -0700, Andy Lutomirski wrote:
> >> On Tue, Sep 8, 2015 at 3:39 PM, Darrick J. Wong  
> >> wrote:
> >> > On Tue, Sep 08, 2015 at 02:45:39PM -0700, Andy Lutomirski wrote:
> >> >> What I meant by this was: if you ask for "regular copy", you may end
> >> >> up with a reflink anyway.  Anyway, how can you reflink a range and
> >> >> have the contents *not* be the same?
> >> >
> >> > reflink forcibly remaps fd_dest's range to fd_src's range.  If they 
> >> > didn't
> >> > match before, they will afterwards.
> >> >
> >> > dedupe remaps fd_dest's range to fd_src's range only if they match, of 
> >> > course.
> >> >
> >> > Perhaps I should have said "...if the contents are the same before the 
> >> > call"?
> >> >
> >>
> >> Oh, I see.
> >>
> >> Can we have a clean way to figure out whether two file ranges are the
> >> same in a way that allows false negatives?  I.e. return 1 if the
> >> ranges are reflinks of each other and 0 if not?  Pretty please?  I've
> >> implemented that in the past on btrfs by syncing the ranges and then
> >> comparing FIEMAP output, but that's hideous.
> >
> > I'd almost rather have a separate call, maybe unshare_file_range()?
> >
> > Is that the end goal to the sharing check?
> 
> My use case was archival.  I can reflink data between a working copy
> and some archived copy and then I can very efficiently tell if the
> working copy has been changed by checking if the reflink is still
> linked.
> 
> It would be even better if I could enumerate which parts of one file
> match which parts of another file.

Oh ok, we can do that pretty quickly with the btrfs searching ioctl
(just walk the items, really fast), but that's root only.

For a real interface maybe a btrfs specific ioctl to compare file
ranges.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 0/8] VFS: In-kernel copy system call

2015-09-09 Thread Chris Mason
On Tue, Sep 08, 2015 at 04:08:43PM -0700, Andy Lutomirski wrote:
> On Tue, Sep 8, 2015 at 3:39 PM, Darrick J. Wong  
> wrote:
> > On Tue, Sep 08, 2015 at 02:45:39PM -0700, Andy Lutomirski wrote:
> >> What I meant by this was: if you ask for "regular copy", you may end
> >> up with a reflink anyway.  Anyway, how can you reflink a range and
> >> have the contents *not* be the same?
> >
> > reflink forcibly remaps fd_dest's range to fd_src's range.  If they didn't
> > match before, they will afterwards.
> >
> > dedupe remaps fd_dest's range to fd_src's range only if they match, of 
> > course.
> >
> > Perhaps I should have said "...if the contents are the same before the 
> > call"?
> >
> 
> Oh, I see.
> 
> Can we have a clean way to figure out whether two file ranges are the
> same in a way that allows false negatives?  I.e. return 1 if the
> ranges are reflinks of each other and 0 if not?  Pretty please?  I've
> implemented that in the past on btrfs by syncing the ranges and then
> comparing FIEMAP output, but that's hideous.

I'd almost rather have a separate call, maybe unshare_file_range()?

Is that the end goal to the sharing check?

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 0/8] VFS: In-kernel copy system call

2015-09-09 Thread Darrick J. Wong
On Wed, Sep 09, 2015 at 02:52:08PM -0400, Anna Schumaker wrote:
> On 09/08/2015 06:39 PM, Darrick J. Wong wrote:
> > On Tue, Sep 08, 2015 at 02:45:39PM -0700, Andy Lutomirski wrote:
> >> On Tue, Sep 8, 2015 at 2:29 PM, Darrick J. Wong  
> >> wrote:
> >>> On Tue, Sep 08, 2015 at 09:03:09PM +0100, Pádraig Brady wrote:
>  On 08/09/15 20:10, Andy Lutomirski wrote:
> > On Tue, Sep 8, 2015 at 11:23 AM, Anna Schumaker
> >  wrote:
> >> On 09/08/2015 11:21 AM, Pádraig Brady wrote:
> >>> I see copy_file_range() is a reflink() on BTRFS?
> >>> That's a bit surprising, as it avoids the copy completely.
> >>> cp(1) for example considered doing a BTRFS clone by default,
> >>> but didn't due to expectations that users actually wanted
> >>> the data duplicated on disk for resilience reasons,
> >>> and for performance reasons so that write latencies were
> >>> restricted to the copy operation, rather than being
> >>> introduced at usage time as the dest file is CoW'd.
> >>>
> >>> If reflink() is a possibility for copy_file_range()
> >>> then could it be done optionally with a flag?
> >>
> >> The idea is that filesystems get to choose how to handle copies in the
> >> default case.  BTRFS could do a reflink, but NFS could do a server side
> >>>
> >>> Eww, different default behaviors depending on the filesystem. :)
> >>>
> >> copy instead.  I can change the default behavior to only do a data copy
> >> (unless the reflink flag is specified) instead, if that is desirable.
> >>
> >> What does everybody think?
> >
> > I think the best you could do is to have a hint asking politely for
> > the data to be deep-copied.  After all, some filesystems reserve the
> > right to transparently deduplicate.
> >
> > Also, on a true COW filesystem (e.g. btrfs sometimes), there may be no
> > advantage to deep copying unless you actually want two copies for
> > locality reasons.
> 
>  Agreed. The relink and server side copy are separate things.
>  There's no advantage to not doing a server side copy,
>  but as mentioned there may be advantages to doing deep copies on BTRFS
>  (another reason not previous mentioned in this thread, would be
>  to avoid ENOSPC errors at some time in the future).
> 
>  So having control over the deep copy seems useful.
>  It's debatable whether ALLOW_REFLINK should be on/off by default
>  for copy_file_range().  I'd be inclined to have such a setting off by 
>  default,
>  but cp(1) at least will work with whatever is chosen.
> >>>
> >>> So far it looks like people are interested in at least these "make data 
> >>> appear
> >>> in this other place" filesystem operations:
> >>>
> >>> 1. reflink
> >>> 2. reflink, but only if the contents are the same (dedupe)
> >>
> >> What I meant by this was: if you ask for "regular copy", you may end
> >> up with a reflink anyway.  Anyway, how can you reflink a range and
> >> have the contents *not* be the same?
> > 
> > reflink forcibly remaps fd_dest's range to fd_src's range.  If they didn't
> > match before, they will afterwards.
> > 
> > dedupe remaps fd_dest's range to fd_src's range only if they match, of 
> > course.
> > 
> > Perhaps I should have said "...if the contents are the same before the 
> > call"?
> > 
> >>
> >>> 3. regular copy
> >>> 4. regular copy, but make the hardware do it for us
> >>> 5. regular copy, but require a second copy on the media (no-dedupe)
> >>
> >> If this comes from me, I have no desire to ever use this as a flag.
> > 
> > I meant (5) as a "disable auto-dedupe for this operation" flag, not as
> > a "reallocate all the shared blocks now" op...
> > 
> >> If someone wants to use chattr or some new operation to say "make this
> >> range of this file belong just to me for purpose of optimizing future
> >> writes", then sure, go for it, with the understanding that there are
> >> plenty of filesystems for which that doesn't even make sense.
> > 
> > "Unshare these blocks" sounds more like something fallocate could do.
> > 
> > So far in my XFS reflink playground, it seems that using the defrag tool to
> > un-cow a file makes most sense.  AFAICT the XFS and ext4 defraggers copy a
> > fragmented file's data to a second file and use a 'swap extents' operation,
> > after which the donor file is unlinked.
> > 
> > Hey, if this syscall turns into a more generic "do something involving two
> > (fd:off:len) (fd:off:len) tuples" call, I guess we could throw in "swap
> > extents" as a 7th operation, to refactor the ioctls.  
> > 
> >>
> >>> 6. regular copy, but don't CoW (eatmyothercopies) (joke)
> >>>
> >>> (Please add whatever ops I missed.)
> >>>
> >>> I think I can see a case for letting (4) fall back to (3) since (4) is an
> >>> optimization of (3).
> >>>
> >>> However, I particularly don't like the idea of (1) falling back to (3-5).
> >>> 

Re: [PATCH v1 0/8] VFS: In-kernel copy system call

2015-09-09 Thread Darrick J. Wong
On Wed, Sep 09, 2015 at 04:41:34PM -0400, Anna Schumaker wrote:
> On 09/09/2015 04:38 PM, Chris Mason wrote:
> > On Wed, Sep 09, 2015 at 04:26:58PM -0400, Trond Myklebust wrote:
> >> On Wed, Sep 9, 2015 at 4:09 PM, Chris Mason  wrote:
> >>> On Tue, Sep 08, 2015 at 04:08:43PM -0700, Andy Lutomirski wrote:
>  On Tue, Sep 8, 2015 at 3:39 PM, Darrick J. Wong 
>   wrote:
> > On Tue, Sep 08, 2015 at 02:45:39PM -0700, Andy Lutomirski wrote:
> >> What I meant by this was: if you ask for "regular copy", you may end
> >> up with a reflink anyway.  Anyway, how can you reflink a range and
> >> have the contents *not* be the same?
> >
> > reflink forcibly remaps fd_dest's range to fd_src's range.  If they 
> > didn't
> > match before, they will afterwards.
> >
> > dedupe remaps fd_dest's range to fd_src's range only if they match, of 
> > course.
> >
> > Perhaps I should have said "...if the contents are the same before the 
> > call"?
> >
> 
>  Oh, I see.
> 
>  Can we have a clean way to figure out whether two file ranges are the
>  same in a way that allows false negatives?  I.e. return 1 if the
>  ranges are reflinks of each other and 0 if not?  Pretty please?  I've
>  implemented that in the past on btrfs by syncing the ranges and then
>  comparing FIEMAP output, but that's hideous.
> >>>
> >>> I'd almost rather have a separate call, maybe unshare_file_range()?
> >>>
> >>
> >> Doesn't it make more sense to put that functionality in fallocate()?

[slightly off-topic]

How about FALLOC_FL_UNSHARE_RANGE?  I've been looking for a place to land
an unshare op that isn't chattr +C, and fallocate seems like a better fit
anyway.

--D

> > 
> > That works too, I'm just hoping to keep the copy_file_range stuff
> > simple.
> 
> I agree with keeping copy_file_range() simple, especially for the initial
> merge.  Extra stuff can always be added in later :)
> 
> Anna
> 
> > 
> > -chris
> > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 0/8] VFS: In-kernel copy system call

2015-09-08 Thread Pádraig Brady
On 08/09/15 20:10, Andy Lutomirski wrote:
> On Tue, Sep 8, 2015 at 11:23 AM, Anna Schumaker
>  wrote:
>> On 09/08/2015 11:21 AM, Pádraig Brady wrote:
>>> I see copy_file_range() is a reflink() on BTRFS?
>>> That's a bit surprising, as it avoids the copy completely.
>>> cp(1) for example considered doing a BTRFS clone by default,
>>> but didn't due to expectations that users actually wanted
>>> the data duplicated on disk for resilience reasons,
>>> and for performance reasons so that write latencies were
>>> restricted to the copy operation, rather than being
>>> introduced at usage time as the dest file is CoW'd.
>>>
>>> If reflink() is a possibility for copy_file_range()
>>> then could it be done optionally with a flag?
>>
>> The idea is that filesystems get to choose how to handle copies in the 
>> default case.  BTRFS could do a reflink, but NFS could do a server side copy 
>> instead.  I can change the default behavior to only do a data copy (unless 
>> the reflink flag is specified) instead, if that is desirable.
>>
>> What does everybody think?
> 
> I think the best you could do is to have a hint asking politely for
> the data to be deep-copied.  After all, some filesystems reserve the
> right to transparently deduplicate.
> 
> Also, on a true COW filesystem (e.g. btrfs sometimes), there may be no
> advantage to deep copying unless you actually want two copies for
> locality reasons.

Agreed. The relink and server side copy are separate things.
There's no advantage to not doing a server side copy,
but as mentioned there may be advantages to doing deep copies on BTRFS
(another reason not previous mentioned in this thread, would be
to avoid ENOSPC errors at some time in the future).

So having control over the deep copy seems useful.
It's debatable whether ALLOW_REFLINK should be on/off by default
for copy_file_range().  I'd be inclined to have such a setting off by default,
but cp(1) at least will work with whatever is chosen.

thanks,
Pádraig.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 0/8] VFS: In-kernel copy system call

2015-09-08 Thread Andy Lutomirski
On Tue, Sep 8, 2015 at 11:23 AM, Anna Schumaker
 wrote:
> On 09/08/2015 11:21 AM, Pádraig Brady wrote:
>> I see copy_file_range() is a reflink() on BTRFS?
>> That's a bit surprising, as it avoids the copy completely.
>> cp(1) for example considered doing a BTRFS clone by default,
>> but didn't due to expectations that users actually wanted
>> the data duplicated on disk for resilience reasons,
>> and for performance reasons so that write latencies were
>> restricted to the copy operation, rather than being
>> introduced at usage time as the dest file is CoW'd.
>>
>> If reflink() is a possibility for copy_file_range()
>> then could it be done optionally with a flag?
>
> The idea is that filesystems get to choose how to handle copies in the 
> default case.  BTRFS could do a reflink, but NFS could do a server side copy 
> instead.  I can change the default behavior to only do a data copy (unless 
> the reflink flag is specified) instead, if that is desirable.
>
> What does everybody think?

I think the best you could do is to have a hint asking politely for
the data to be deep-copied.  After all, some filesystems reserve the
right to transparently deduplicate.

Also, on a true COW filesystem (e.g. btrfs sometimes), there may be no
advantage to deep copying unless you actually want two copies for
locality reasons.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 0/8] VFS: In-kernel copy system call

2015-09-08 Thread Anna Schumaker
On 09/08/2015 04:45 PM, Darrick J. Wong wrote:
> On Tue, Sep 08, 2015 at 11:08:03AM -0400, Anna Schumaker wrote:
>> On 09/05/2015 04:33 AM, Al Viro wrote:
>>> On Fri, Sep 04, 2015 at 04:25:27PM -0600, Andreas Dilger wrote:
>>>
 This is a bit of a surprising result, since in my testing in the
 past, copy_{to/from}_user() is a major consumer of CPU time (50%
 of a CPU core at 1GB/s).  What backing filesystem did you test on?
>>>
>>> While we are at it, was cp(1) using read(2)/write(2) loop or was it using
>>> something else (sendfile(2), for example)?
>>
>> cp uses a read / write loop, and has some heuristics for guessing an optimum 
>> buffer size.
> 
> ..but afaict cp doesn't fsync at the end, which means it's possible that
> the destination file's blocks are still delalloc and nothing's been flushed
> to disk yet.  What happens if you time (cp /tmp/a /tmp/b ; sync) ?

That's already how I was using cp :).  The example program in my man page also 
doesn't fsync at the end, so the extra sync at the end is needed for both.

Anna

> 
> 2048M / 1.667s = ~1200MB/s.
> 
> --D
> 
>>
>> Anna
>>
>>>
>>

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 0/8] VFS: In-kernel copy system call

2015-09-08 Thread Darrick J. Wong
On Tue, Sep 08, 2015 at 11:08:03AM -0400, Anna Schumaker wrote:
> On 09/05/2015 04:33 AM, Al Viro wrote:
> > On Fri, Sep 04, 2015 at 04:25:27PM -0600, Andreas Dilger wrote:
> > 
> >> This is a bit of a surprising result, since in my testing in the
> >> past, copy_{to/from}_user() is a major consumer of CPU time (50%
> >> of a CPU core at 1GB/s).  What backing filesystem did you test on?
> > 
> > While we are at it, was cp(1) using read(2)/write(2) loop or was it using
> > something else (sendfile(2), for example)?
> 
> cp uses a read / write loop, and has some heuristics for guessing an optimum 
> buffer size.

..but afaict cp doesn't fsync at the end, which means it's possible that
the destination file's blocks are still delalloc and nothing's been flushed
to disk yet.  What happens if you time (cp /tmp/a /tmp/b ; sync) ?

2048M / 1.667s = ~1200MB/s.

--D

> 
> Anna
> 
> > 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 0/8] VFS: In-kernel copy system call

2015-09-08 Thread Darrick J. Wong
On Tue, Sep 08, 2015 at 09:03:09PM +0100, Pádraig Brady wrote:
> On 08/09/15 20:10, Andy Lutomirski wrote:
> > On Tue, Sep 8, 2015 at 11:23 AM, Anna Schumaker
> >  wrote:
> >> On 09/08/2015 11:21 AM, Pádraig Brady wrote:
> >>> I see copy_file_range() is a reflink() on BTRFS?
> >>> That's a bit surprising, as it avoids the copy completely.
> >>> cp(1) for example considered doing a BTRFS clone by default,
> >>> but didn't due to expectations that users actually wanted
> >>> the data duplicated on disk for resilience reasons,
> >>> and for performance reasons so that write latencies were
> >>> restricted to the copy operation, rather than being
> >>> introduced at usage time as the dest file is CoW'd.
> >>>
> >>> If reflink() is a possibility for copy_file_range()
> >>> then could it be done optionally with a flag?
> >>
> >> The idea is that filesystems get to choose how to handle copies in the
> >> default case.  BTRFS could do a reflink, but NFS could do a server side

Eww, different default behaviors depending on the filesystem. :)

> >> copy instead.  I can change the default behavior to only do a data copy
> >> (unless the reflink flag is specified) instead, if that is desirable.
> >>
> >> What does everybody think?
> > 
> > I think the best you could do is to have a hint asking politely for
> > the data to be deep-copied.  After all, some filesystems reserve the
> > right to transparently deduplicate.
> > 
> > Also, on a true COW filesystem (e.g. btrfs sometimes), there may be no
> > advantage to deep copying unless you actually want two copies for
> > locality reasons.
> 
> Agreed. The relink and server side copy are separate things.
> There's no advantage to not doing a server side copy,
> but as mentioned there may be advantages to doing deep copies on BTRFS
> (another reason not previous mentioned in this thread, would be
> to avoid ENOSPC errors at some time in the future).
> 
> So having control over the deep copy seems useful.
> It's debatable whether ALLOW_REFLINK should be on/off by default
> for copy_file_range().  I'd be inclined to have such a setting off by default,
> but cp(1) at least will work with whatever is chosen.

So far it looks like people are interested in at least these "make data appear
in this other place" filesystem operations:

1. reflink
2. reflink, but only if the contents are the same (dedupe)
3. regular copy
4. regular copy, but make the hardware do it for us
5. regular copy, but require a second copy on the media (no-dedupe)
6. regular copy, but don't CoW (eatmyothercopies) (joke)

(Please add whatever ops I missed.)

I think I can see a case for letting (4) fall back to (3) since (4) is an
optimization of (3).

However, I particularly don't like the idea of (1) falling back to (3-5).
Either the kernel can satisfy a request or it can't, but let's not just
assume that we should transmogrify one type of request into another.  Userspace
should decide if a reflink failure should turn into one of the copy variants,
depending on whether the user wants to spread allocation costs over rewrites or
pay it all up front.  Also, if we allow reflink to fall back to copy, how do
programs find out what actually took place?  Or do we simply not allow them to
find out?

Also, programs that expect reflink either to finish or fail quickly might be
surprised if it's possible for reflink to take a longer time than usual and
with the side effect that a deep(er) copy was made.

I guess if someone asks for both (1) and (3) we can do the fallback in the
kernel, like how we handle it right now.

--D

> 
> thanks,
> Pádraig.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 0/8] VFS: In-kernel copy system call

2015-09-08 Thread Darrick J. Wong
On Tue, Sep 08, 2015 at 02:45:39PM -0700, Andy Lutomirski wrote:
> On Tue, Sep 8, 2015 at 2:29 PM, Darrick J. Wong  
> wrote:
> > On Tue, Sep 08, 2015 at 09:03:09PM +0100, Pádraig Brady wrote:
> >> On 08/09/15 20:10, Andy Lutomirski wrote:
> >> > On Tue, Sep 8, 2015 at 11:23 AM, Anna Schumaker
> >> >  wrote:
> >> >> On 09/08/2015 11:21 AM, Pádraig Brady wrote:
> >> >>> I see copy_file_range() is a reflink() on BTRFS?
> >> >>> That's a bit surprising, as it avoids the copy completely.
> >> >>> cp(1) for example considered doing a BTRFS clone by default,
> >> >>> but didn't due to expectations that users actually wanted
> >> >>> the data duplicated on disk for resilience reasons,
> >> >>> and for performance reasons so that write latencies were
> >> >>> restricted to the copy operation, rather than being
> >> >>> introduced at usage time as the dest file is CoW'd.
> >> >>>
> >> >>> If reflink() is a possibility for copy_file_range()
> >> >>> then could it be done optionally with a flag?
> >> >>
> >> >> The idea is that filesystems get to choose how to handle copies in the
> >> >> default case.  BTRFS could do a reflink, but NFS could do a server side
> >
> > Eww, different default behaviors depending on the filesystem. :)
> >
> >> >> copy instead.  I can change the default behavior to only do a data copy
> >> >> (unless the reflink flag is specified) instead, if that is desirable.
> >> >>
> >> >> What does everybody think?
> >> >
> >> > I think the best you could do is to have a hint asking politely for
> >> > the data to be deep-copied.  After all, some filesystems reserve the
> >> > right to transparently deduplicate.
> >> >
> >> > Also, on a true COW filesystem (e.g. btrfs sometimes), there may be no
> >> > advantage to deep copying unless you actually want two copies for
> >> > locality reasons.
> >>
> >> Agreed. The relink and server side copy are separate things.
> >> There's no advantage to not doing a server side copy,
> >> but as mentioned there may be advantages to doing deep copies on BTRFS
> >> (another reason not previous mentioned in this thread, would be
> >> to avoid ENOSPC errors at some time in the future).
> >>
> >> So having control over the deep copy seems useful.
> >> It's debatable whether ALLOW_REFLINK should be on/off by default
> >> for copy_file_range().  I'd be inclined to have such a setting off by 
> >> default,
> >> but cp(1) at least will work with whatever is chosen.
> >
> > So far it looks like people are interested in at least these "make data 
> > appear
> > in this other place" filesystem operations:
> >
> > 1. reflink
> > 2. reflink, but only if the contents are the same (dedupe)
> 
> What I meant by this was: if you ask for "regular copy", you may end
> up with a reflink anyway.  Anyway, how can you reflink a range and
> have the contents *not* be the same?

reflink forcibly remaps fd_dest's range to fd_src's range.  If they didn't
match before, they will afterwards.

dedupe remaps fd_dest's range to fd_src's range only if they match, of course.

Perhaps I should have said "...if the contents are the same before the call"?

> 
> > 3. regular copy
> > 4. regular copy, but make the hardware do it for us
> > 5. regular copy, but require a second copy on the media (no-dedupe)
> 
> If this comes from me, I have no desire to ever use this as a flag.

I meant (5) as a "disable auto-dedupe for this operation" flag, not as
a "reallocate all the shared blocks now" op...

> If someone wants to use chattr or some new operation to say "make this
> range of this file belong just to me for purpose of optimizing future
> writes", then sure, go for it, with the understanding that there are
> plenty of filesystems for which that doesn't even make sense.

"Unshare these blocks" sounds more like something fallocate could do.

So far in my XFS reflink playground, it seems that using the defrag tool to
un-cow a file makes most sense.  AFAICT the XFS and ext4 defraggers copy a
fragmented file's data to a second file and use a 'swap extents' operation,
after which the donor file is unlinked.

Hey, if this syscall turns into a more generic "do something involving two
(fd:off:len) (fd:off:len) tuples" call, I guess we could throw in "swap
extents" as a 7th operation, to refactor the ioctls.  

> 
> > 6. regular copy, but don't CoW (eatmyothercopies) (joke)
> >
> > (Please add whatever ops I missed.)
> >
> > I think I can see a case for letting (4) fall back to (3) since (4) is an
> > optimization of (3).
> >
> > However, I particularly don't like the idea of (1) falling back to (3-5).
> > Either the kernel can satisfy a request or it can't, but let's not just
> > assume that we should transmogrify one type of request into another.  
> > Userspace
> > should decide if a reflink failure should turn into one of the copy 
> > variants,
> > depending on whether the user wants to spread allocation costs over 
> > rewrites or
> > pay it all up front. 

Re: [PATCH v1 0/8] VFS: In-kernel copy system call

2015-09-08 Thread Andy Lutomirski
On Tue, Sep 8, 2015 at 3:39 PM, Darrick J. Wong  wrote:
> On Tue, Sep 08, 2015 at 02:45:39PM -0700, Andy Lutomirski wrote:
>> On Tue, Sep 8, 2015 at 2:29 PM, Darrick J. Wong  
>> wrote:
>> > On Tue, Sep 08, 2015 at 09:03:09PM +0100, Pádraig Brady wrote:
>> >> On 08/09/15 20:10, Andy Lutomirski wrote:
>> >> > On Tue, Sep 8, 2015 at 11:23 AM, Anna Schumaker
>> >> >  wrote:
>> >> >> On 09/08/2015 11:21 AM, Pádraig Brady wrote:
>> >> >>> I see copy_file_range() is a reflink() on BTRFS?
>> >> >>> That's a bit surprising, as it avoids the copy completely.
>> >> >>> cp(1) for example considered doing a BTRFS clone by default,
>> >> >>> but didn't due to expectations that users actually wanted
>> >> >>> the data duplicated on disk for resilience reasons,
>> >> >>> and for performance reasons so that write latencies were
>> >> >>> restricted to the copy operation, rather than being
>> >> >>> introduced at usage time as the dest file is CoW'd.
>> >> >>>
>> >> >>> If reflink() is a possibility for copy_file_range()
>> >> >>> then could it be done optionally with a flag?
>> >> >>
>> >> >> The idea is that filesystems get to choose how to handle copies in the
>> >> >> default case.  BTRFS could do a reflink, but NFS could do a server side
>> >
>> > Eww, different default behaviors depending on the filesystem. :)
>> >
>> >> >> copy instead.  I can change the default behavior to only do a data copy
>> >> >> (unless the reflink flag is specified) instead, if that is desirable.
>> >> >>
>> >> >> What does everybody think?
>> >> >
>> >> > I think the best you could do is to have a hint asking politely for
>> >> > the data to be deep-copied.  After all, some filesystems reserve the
>> >> > right to transparently deduplicate.
>> >> >
>> >> > Also, on a true COW filesystem (e.g. btrfs sometimes), there may be no
>> >> > advantage to deep copying unless you actually want two copies for
>> >> > locality reasons.
>> >>
>> >> Agreed. The relink and server side copy are separate things.
>> >> There's no advantage to not doing a server side copy,
>> >> but as mentioned there may be advantages to doing deep copies on BTRFS
>> >> (another reason not previous mentioned in this thread, would be
>> >> to avoid ENOSPC errors at some time in the future).
>> >>
>> >> So having control over the deep copy seems useful.
>> >> It's debatable whether ALLOW_REFLINK should be on/off by default
>> >> for copy_file_range().  I'd be inclined to have such a setting off by 
>> >> default,
>> >> but cp(1) at least will work with whatever is chosen.
>> >
>> > So far it looks like people are interested in at least these "make data 
>> > appear
>> > in this other place" filesystem operations:
>> >
>> > 1. reflink
>> > 2. reflink, but only if the contents are the same (dedupe)
>>
>> What I meant by this was: if you ask for "regular copy", you may end
>> up with a reflink anyway.  Anyway, how can you reflink a range and
>> have the contents *not* be the same?
>
> reflink forcibly remaps fd_dest's range to fd_src's range.  If they didn't
> match before, they will afterwards.
>
> dedupe remaps fd_dest's range to fd_src's range only if they match, of course.
>
> Perhaps I should have said "...if the contents are the same before the call"?
>

Oh, I see.

Can we have a clean way to figure out whether two file ranges are the
same in a way that allows false negatives?  I.e. return 1 if the
ranges are reflinks of each other and 0 if not?  Pretty please?  I've
implemented that in the past on btrfs by syncing the ranges and then
comparing FIEMAP output, but that's hideous.

>>
>> > 3. regular copy
>> > 4. regular copy, but make the hardware do it for us
>> > 5. regular copy, but require a second copy on the media (no-dedupe)
>>
>> If this comes from me, I have no desire to ever use this as a flag.
>
> I meant (5) as a "disable auto-dedupe for this operation" flag, not as
> a "reallocate all the shared blocks now" op...

Hmm, interesting.  What effect does it have on systems that do
deferred auto-dedupe?

>>
>> I think we should focus on what the actual legit use cases might be.
>> Certainly we want to support a mode that's "reflink or fail".  We
>> could have these flags:
>>
>> COPY_FILE_RANGE_ALLOW_REFLINK
>> COPY_FILE_RANGE_ALLOW_COPY
>>
>> Setting neither gets -EINVAL.  Setting both works as is.  Setting just
>> ALLOW_REFLINK will fail if a reflink can't be supported.  Setting just
>> ALLOW_COPY will make a best-effort attempt not to reflink but
>> expressly permits reflinking in cases where either (a) plain old
>> write(2) might also result in a reflink or (b) there is no advantage
>> to not reflinking.
>
> I don't agree with having a 'copy' flag that can reflink when we also have a
> 'reflink' flag.  I guess I just don't like having a flag with different
> meanings depending on context.
>
> Users should be able to get the default behavior by passing '0' for flags, so
> provide 

Re: [PATCH v1 0/8] VFS: In-kernel copy system call

2015-09-08 Thread Pádraig Brady
On 04/09/15 21:16, Anna Schumaker wrote:
> Copy system calls came up during Plumbers a couple of weeks ago, because
> several filesystems (including NFS and XFS) are currently working on copy
> acceleration implementations.  We haven't heard from Zach Brown in a while,
> so I volunteered to push his patches upstream so individual filesystems
> don't need to keep writing their own ioctls.

Just mentioning that this is just pertaining to the data, not the metadata.
Providing metadata copying facilities would be _very_ useful, as
most file system specific details relate to the metadata, and having
VFS operations for that would avoid the plethora of details in each userspace 
tool,
and theoretically support translations between disparate metadata.

> The first three patches are a simple reposting of Zach's patches from several
> months ago, with one minor error code fix.  The remaining patches add in a
> fallback mechanism when filesystems don't provide a copy function.  This is
> especially useful when doing a server-side copy on NFS (using the new COPY
> operation in NFS v4.2).  This fallback can be disabled by passing the flag
> COPY_REFLINK to the system call.

I see copy_file_range() is a reflink() on BTRFS?
That's a bit surprising, as it avoids the copy completely.
cp(1) for example considered doing a BTRFS clone by default,
but didn't due to expectations that users actually wanted
the data duplicated on disk for resilience reasons,
and for performance reasons so that write latencies were
restricted to the copy operation, rather than being
introduced at usage time as the dest file is CoW'd.

If reflink() is a possibility for copy_file_range()
then could it be done optionally with a flag?

thanks,
Pádraig
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 0/8] VFS: In-kernel copy system call

2015-09-08 Thread Darrick J. Wong
On Tue, Sep 08, 2015 at 04:08:43PM -0700, Andy Lutomirski wrote:
> On Tue, Sep 8, 2015 at 3:39 PM, Darrick J. Wong  
> wrote:
> > On Tue, Sep 08, 2015 at 02:45:39PM -0700, Andy Lutomirski wrote:
> >> On Tue, Sep 8, 2015 at 2:29 PM, Darrick J. Wong  
> >> wrote:
> >> > On Tue, Sep 08, 2015 at 09:03:09PM +0100, Pádraig Brady wrote:
> >> >> On 08/09/15 20:10, Andy Lutomirski wrote:
> >> >> > On Tue, Sep 8, 2015 at 11:23 AM, Anna Schumaker
> >> >> >  wrote:
> >> >> >> On 09/08/2015 11:21 AM, Pádraig Brady wrote:
> >> >> >>> I see copy_file_range() is a reflink() on BTRFS?
> >> >> >>> That's a bit surprising, as it avoids the copy completely.
> >> >> >>> cp(1) for example considered doing a BTRFS clone by default,
> >> >> >>> but didn't due to expectations that users actually wanted
> >> >> >>> the data duplicated on disk for resilience reasons,
> >> >> >>> and for performance reasons so that write latencies were
> >> >> >>> restricted to the copy operation, rather than being
> >> >> >>> introduced at usage time as the dest file is CoW'd.
> >> >> >>>
> >> >> >>> If reflink() is a possibility for copy_file_range()
> >> >> >>> then could it be done optionally with a flag?
> >> >> >>
> >> >> >> The idea is that filesystems get to choose how to handle copies in 
> >> >> >> the
> >> >> >> default case.  BTRFS could do a reflink, but NFS could do a server 
> >> >> >> side
> >> >
> >> > Eww, different default behaviors depending on the filesystem. :)
> >> >
> >> >> >> copy instead.  I can change the default behavior to only do a data 
> >> >> >> copy
> >> >> >> (unless the reflink flag is specified) instead, if that is desirable.
> >> >> >>
> >> >> >> What does everybody think?
> >> >> >
> >> >> > I think the best you could do is to have a hint asking politely for
> >> >> > the data to be deep-copied.  After all, some filesystems reserve the
> >> >> > right to transparently deduplicate.
> >> >> >
> >> >> > Also, on a true COW filesystem (e.g. btrfs sometimes), there may be no
> >> >> > advantage to deep copying unless you actually want two copies for
> >> >> > locality reasons.
> >> >>
> >> >> Agreed. The relink and server side copy are separate things.
> >> >> There's no advantage to not doing a server side copy,
> >> >> but as mentioned there may be advantages to doing deep copies on BTRFS
> >> >> (another reason not previous mentioned in this thread, would be
> >> >> to avoid ENOSPC errors at some time in the future).
> >> >>
> >> >> So having control over the deep copy seems useful.
> >> >> It's debatable whether ALLOW_REFLINK should be on/off by default
> >> >> for copy_file_range().  I'd be inclined to have such a setting off by 
> >> >> default,
> >> >> but cp(1) at least will work with whatever is chosen.
> >> >
> >> > So far it looks like people are interested in at least these "make data 
> >> > appear
> >> > in this other place" filesystem operations:
> >> >
> >> > 1. reflink
> >> > 2. reflink, but only if the contents are the same (dedupe)
> >>
> >> What I meant by this was: if you ask for "regular copy", you may end
> >> up with a reflink anyway.  Anyway, how can you reflink a range and
> >> have the contents *not* be the same?
> >
> > reflink forcibly remaps fd_dest's range to fd_src's range.  If they didn't
> > match before, they will afterwards.
> >
> > dedupe remaps fd_dest's range to fd_src's range only if they match, of 
> > course.
> >
> > Perhaps I should have said "...if the contents are the same before the 
> > call"?
> >
> 
> Oh, I see.
> 
> Can we have a clean way to figure out whether two file ranges are the
> same in a way that allows false negatives?  I.e. return 1 if the
> ranges are reflinks of each other and 0 if not?  Pretty please?  I've
> implemented that in the past on btrfs by syncing the ranges and then
> comparing FIEMAP output, but that's hideous.

Another mode for this call... :)

> >>
> >> > 3. regular copy
> >> > 4. regular copy, but make the hardware do it for us
> >> > 5. regular copy, but require a second copy on the media (no-dedupe)
> >>
> >> If this comes from me, I have no desire to ever use this as a flag.
> >
> > I meant (5) as a "disable auto-dedupe for this operation" flag, not as
> > a "reallocate all the shared blocks now" op...
> 
> Hmm, interesting.  What effect does it have on systems that do
> deferred auto-dedupe?

If it's a userspace deferred auto-dedupe, then hopefully the program
coordinates with the dedupe program.

Otherwise, it's only effective with a dedupe that runs in the write-path.

> >>
> >> I think we should focus on what the actual legit use cases might be.
> >> Certainly we want to support a mode that's "reflink or fail".  We
> >> could have these flags:
> >>
> >> COPY_FILE_RANGE_ALLOW_REFLINK
> >> COPY_FILE_RANGE_ALLOW_COPY
> >>
> >> Setting neither gets -EINVAL.  Setting both works as is.  Setting just
> >> ALLOW_REFLINK will fail if a reflink can't be supported.  

Re: [PATCH v1 0/8] VFS: In-kernel copy system call

2015-09-08 Thread Anna Schumaker
On 09/04/2015 06:25 PM, Andreas Dilger wrote:
> On Sep 4, 2015, at 2:16 PM, Anna Schumaker  wrote:
>>
>> Copy system calls came up during Plumbers a couple of weeks ago,
>> because several filesystems (including NFS and XFS) are currently
>> working on copy acceleration implementations.  We haven't heard from
>> Zach Brown in a while, so I volunteered to push his patches upstream
>> so individual filesystems don't need to keep writing their own ioctls.
>>
>> The first three patches are a simple reposting of Zach's patches
>> from several months ago, with one minor error code fix.  The remaining
>> patches add in a fallback mechanism when filesystems don't provide a
>> copy function.  This is especially useful when doing a server-side
>> copy on NFS (using the new COPY operation in NFS v4.2).  This fallback
>> can be disabled by passing the flag COPY_REFLINK to the system call.
>>
>> The last patch is a man page patch documenting this new system call,
>> including an example program.
>>
>> I tested the fallback option by using /dev/urandom to generate files
>> of varying sizes and copying them.  I compared the time to copy
>> against that of `cp` just to see if there is a noticable difference.
>> I found that runtimes are roughly the same, but in-kernel copy tends
>> to use less of the cpu.  Values in the tables below are averages
>> across multiple trials.
>>
>>
>> /usr/bin/cp |   512 MB  |   1024 MB |   1536 MB |   2048 MB
>> -|---|---|---|---
>>   user  |   0.00s   |   0.00s   |   0.00s   |   0.00s
>> system  |   0.32s   |   0.52s   |   1.04s   |   1.04s
>>cpu  | 73%   | 69%   | 62%   | 62%
>>  total  |   0.446   |   0.757   |   1.197   |   1.667
>>
>>
>>   VFS copy  |   512 MB  |   1024 MB |   1536 MB |   2048 MB
>> -|---|---|---|---
>>   user  |   0.00s   |   0.00s   |   0.00s   |  0.00s
>> system  |   0.33s   |   0.49s   |   0.76s   |  0.99s
>>cpu  | 77%   | 62%   | 60%   |59%
>>  total  |   0.422   |   0.777   |   1.267   |  1.655
>>
>>
>> Questions?  Comments?  Thoughts?
> 
> This is a bit of a surprising result, since in my testing in the
> past, copy_{to/from}_user() is a major consumer of CPU time (50%
> of a CPU core at 1GB/s).  What backing filesystem did you test on?

I tested using XFS against two KVM guests.  Maybe something there is adding the 
extra cpu cycles?

Anna

> 
> In theory, the VFS copy routines should save at least 50% of the
> CPU usage since it only needs to make one copy (src->dest) instead
> of two (kernel->user, user->kernel).  Ideally it wouldn't make any
> data copies at all and just pass page references from the source
> to the target.
> 
> Cheers, Andreas
>>
>> Anna
>>
>>
>> Anna Schumaker (5):
>>  btrfs: Add mountpoint checking during btrfs_copy_file_range
>>  vfs: Remove copy_file_range mountpoint checks
>>  vfs: Copy should check len after file open mode
>>  vfs: Copy should use file_out rather than file_in
>>  vfs: Fall back on splice if no copy function defined
>>
>> Zach Brown (3):
>>  vfs: add copy_file_range syscall and vfs helper
>>  x86: add sys_copy_file_range to syscall tables
>>  btrfs: add .copy_file_range file operation
>>
>> arch/x86/entry/syscalls/syscall_32.tbl |   1 +
>> arch/x86/entry/syscalls/syscall_64.tbl |   1 +
>> fs/btrfs/ctree.h   |   3 +
>> fs/btrfs/file.c|   1 +
>> fs/btrfs/ioctl.c   |  95 ++--
>> fs/read_write.c| 132 
>> +
>> include/linux/copy.h   |   6 ++
>> include/linux/fs.h |   3 +
>> include/uapi/asm-generic/unistd.h  |   4 +-
>> include/uapi/linux/Kbuild  |   1 +
>> include/uapi/linux/copy.h  |   6 ++
>> kernel/sys_ni.c|   1 +
>> 12 files changed, 214 insertions(+), 40 deletions(-)
>> create mode 100644 include/linux/copy.h
>> create mode 100644 include/uapi/linux/copy.h
>>
>> -- 
>> 2.5.1
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> Cheers, Andreas
> 
> 
> 
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 0/8] VFS: In-kernel copy system call

2015-09-05 Thread Al Viro
On Fri, Sep 04, 2015 at 04:25:27PM -0600, Andreas Dilger wrote:

> This is a bit of a surprising result, since in my testing in the
> past, copy_{to/from}_user() is a major consumer of CPU time (50%
> of a CPU core at 1GB/s).  What backing filesystem did you test on?

While we are at it, was cp(1) using read(2)/write(2) loop or was it using
something else (sendfile(2), for example)?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 0/8] VFS: In-kernel copy system call

2015-09-04 Thread Andreas Dilger
On Sep 4, 2015, at 2:16 PM, Anna Schumaker  wrote:
> 
> Copy system calls came up during Plumbers a couple of weeks ago,
> because several filesystems (including NFS and XFS) are currently
> working on copy acceleration implementations.  We haven't heard from
> Zach Brown in a while, so I volunteered to push his patches upstream
> so individual filesystems don't need to keep writing their own ioctls.
> 
> The first three patches are a simple reposting of Zach's patches
> from several months ago, with one minor error code fix.  The remaining
> patches add in a fallback mechanism when filesystems don't provide a
> copy function.  This is especially useful when doing a server-side
> copy on NFS (using the new COPY operation in NFS v4.2).  This fallback
> can be disabled by passing the flag COPY_REFLINK to the system call.
> 
> The last patch is a man page patch documenting this new system call,
> including an example program.
> 
> I tested the fallback option by using /dev/urandom to generate files
> of varying sizes and copying them.  I compared the time to copy
> against that of `cp` just to see if there is a noticable difference.
> I found that runtimes are roughly the same, but in-kernel copy tends
> to use less of the cpu.  Values in the tables below are averages
> across multiple trials.
> 
> 
> /usr/bin/cp |   512 MB  |   1024 MB |   1536 MB |   2048 MB
> -|---|---|---|---
>   user  |   0.00s   |   0.00s   |   0.00s   |   0.00s
> system  |   0.32s   |   0.52s   |   1.04s   |   1.04s
>cpu  | 73%   | 69%   | 62%   | 62%
>  total  |   0.446   |   0.757   |   1.197   |   1.667
> 
> 
>   VFS copy  |   512 MB  |   1024 MB |   1536 MB |   2048 MB
> -|---|---|---|---
>   user  |   0.00s   |   0.00s   |   0.00s   |  0.00s
> system  |   0.33s   |   0.49s   |   0.76s   |  0.99s
>cpu  | 77%   | 62%   | 60%   |59%
>  total  |   0.422   |   0.777   |   1.267   |  1.655
> 
> 
> Questions?  Comments?  Thoughts?

This is a bit of a surprising result, since in my testing in the
past, copy_{to/from}_user() is a major consumer of CPU time (50%
of a CPU core at 1GB/s).  What backing filesystem did you test on?

In theory, the VFS copy routines should save at least 50% of the
CPU usage since it only needs to make one copy (src->dest) instead
of two (kernel->user, user->kernel).  Ideally it wouldn't make any
data copies at all and just pass page references from the source
to the target.

Cheers, Andreas
> 
> Anna
> 
> 
> Anna Schumaker (5):
>  btrfs: Add mountpoint checking during btrfs_copy_file_range
>  vfs: Remove copy_file_range mountpoint checks
>  vfs: Copy should check len after file open mode
>  vfs: Copy should use file_out rather than file_in
>  vfs: Fall back on splice if no copy function defined
> 
> Zach Brown (3):
>  vfs: add copy_file_range syscall and vfs helper
>  x86: add sys_copy_file_range to syscall tables
>  btrfs: add .copy_file_range file operation
> 
> arch/x86/entry/syscalls/syscall_32.tbl |   1 +
> arch/x86/entry/syscalls/syscall_64.tbl |   1 +
> fs/btrfs/ctree.h   |   3 +
> fs/btrfs/file.c|   1 +
> fs/btrfs/ioctl.c   |  95 ++--
> fs/read_write.c| 132 +
> include/linux/copy.h   |   6 ++
> include/linux/fs.h |   3 +
> include/uapi/asm-generic/unistd.h  |   4 +-
> include/uapi/linux/Kbuild  |   1 +
> include/uapi/linux/copy.h  |   6 ++
> kernel/sys_ni.c|   1 +
> 12 files changed, 214 insertions(+), 40 deletions(-)
> create mode 100644 include/linux/copy.h
> create mode 100644 include/uapi/linux/copy.h
> 
> -- 
> 2.5.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Cheers, Andreas





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html