Re: [RFC PATCH v2 0/2] add simple copy support

2020-12-15 Thread Pavel Machek
Hi!

> > Good idea. Are you thinking of a sysfs entry to select the backend?
> 
> Not sure on this one, initially I thought of a sysfs file, but then
> how would you do it. One "global" sysfs entry is probably a bad idea.
> Having one per block device to select native vs emulation maybe? And
> a good way to benchmark.
> 
> The other idea would be a benchmark loop on boot like the raid library
> does.

Doing copy benchmarking would require writes on the media, right?

Kernel should not do such stuff without userspace requesting it...

Best regards,
Pavel
-- 
http://www.livejournal.com/~pavelmachek


signature.asc
Description: Digital signature


Re: [RFC PATCH v2 0/2] add simple copy support

2020-12-09 Thread Javier González

On 08.12.2020 13:24, Johannes Thumshirn wrote:

On 08/12/2020 14:13, Javier González wrote:

On 08.12.2020 12:37, Johannes Thumshirn wrote:

On 08/12/2020 13:22, Javier González wrote:

Good idea. Are you thinking of a sysfs entry to select the backend?


Not sure on this one, initially I thought of a sysfs file, but then
how would you do it. One "global" sysfs entry is probably a bad idea.
Having one per block device to select native vs emulation maybe? And
a good way to benchmark.


I was thinking a per block device to target the use case where a certain
implementation / workload is better one way or the other.


Yes something along those lines.



The other idea would be a benchmark loop on boot like the raid library
does.

Then on the other hand, there might be workloads that run faster with
the emulation and some that run faster with the hardware acceleration.

I think these points are the reason the last attempts got stuck.


Yes. I believe that any benchmark we run would be biased in a certain
way. If we can move forward with a sysfs entry and default to legacy
path, we would not alter current behavior and enable NVMe copy offload
(for now) for those that want to use it. We can then build on top of it.

Does this sound like a reasonable approach?



Yes this sounds like a reasonable approach to me.


Cool. We will add this to the V3 then.

Thanks Johannes!


Re: [RFC PATCH v2 0/2] add simple copy support

2020-12-08 Thread Martin K. Petersen


Hannes,

[Sorry I'm late to the discussion here, had a few fires going in
addition to the end of the kernel cycle]

> And we shouldn't forget that the main issue which killed all previous
> implementations was a missing QoS guarantee.

That and the fact that our arbitrary stacking situation was hard to
resolve.

The QoS guarantee was somewhat addressed by Fred in T10. But I agree we
need some sort of toggle.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [RFC PATCH v2 0/2] add simple copy support

2020-12-08 Thread Johannes Thumshirn
On 08/12/2020 14:13, Javier González wrote:
> On 08.12.2020 12:37, Johannes Thumshirn wrote:
>> On 08/12/2020 13:22, Javier González wrote:
>>> Good idea. Are you thinking of a sysfs entry to select the backend?
>>
>> Not sure on this one, initially I thought of a sysfs file, but then
>> how would you do it. One "global" sysfs entry is probably a bad idea.
>> Having one per block device to select native vs emulation maybe? And
>> a good way to benchmark.
> 
> I was thinking a per block device to target the use case where a certain
> implementation / workload is better one way or the other.

Yes something along those lines.

>>
>> The other idea would be a benchmark loop on boot like the raid library
>> does.
>>
>> Then on the other hand, there might be workloads that run faster with
>> the emulation and some that run faster with the hardware acceleration.
>>
>> I think these points are the reason the last attempts got stuck.
> 
> Yes. I believe that any benchmark we run would be biased in a certain
> way. If we can move forward with a sysfs entry and default to legacy
> path, we would not alter current behavior and enable NVMe copy offload
> (for now) for those that want to use it. We can then build on top of it.
> 
> Does this sound like a reasonable approach?
> 

Yes this sounds like a reasonable approach to me.


Re: [RFC PATCH v2 0/2] add simple copy support

2020-12-08 Thread Javier González

On 08.12.2020 12:37, Johannes Thumshirn wrote:

On 08/12/2020 13:22, Javier González wrote:

Good idea. Are you thinking of a sysfs entry to select the backend?


Not sure on this one, initially I thought of a sysfs file, but then
how would you do it. One "global" sysfs entry is probably a bad idea.
Having one per block device to select native vs emulation maybe? And
a good way to benchmark.


I was thinking a per block device to target the use case where a certain
implementation / workload is better one way or the other.



The other idea would be a benchmark loop on boot like the raid library
does.

Then on the other hand, there might be workloads that run faster with
the emulation and some that run faster with the hardware acceleration.

I think these points are the reason the last attempts got stuck.


Yes. I believe that any benchmark we run would be biased in a certain
way. If we can move forward with a sysfs entry and default to legacy
path, we would not alter current behavior and enable NVMe copy offload
(for now) for those that want to use it. We can then build on top of it.

Does this sound like a reasonable approach?


Re: [RFC PATCH v2 0/2] add simple copy support

2020-12-08 Thread Johannes Thumshirn
On 08/12/2020 13:22, Javier González wrote:
> Good idea. Are you thinking of a sysfs entry to select the backend?

Not sure on this one, initially I thought of a sysfs file, but then
how would you do it. One "global" sysfs entry is probably a bad idea.
Having one per block device to select native vs emulation maybe? And
a good way to benchmark.

The other idea would be a benchmark loop on boot like the raid library
does.

Then on the other hand, there might be workloads that run faster with 
the emulation and some that run faster with the hardware acceleration.

I think these points are the reason the last attempts got stuck.


Re: [RFC PATCH v2 0/2] add simple copy support

2020-12-08 Thread Javier González

On 08.12.2020 08:40, Johannes Thumshirn wrote:

On 07/12/2020 20:27, Javier González wrote:

Good point. We can share some performance data on how Simple Copy scales
in terms of bw / latency and the CPU usage. Do you have anything else in
mind?



With an emulation in the kernel, we could make the usd "backend"
implementation configurable. So if the emulation is faster, users can select
the emulation, if the device is faster then the device.

Kind of what the crypto and raid code do as well.


Good idea. Are you thinking of a sysfs entry to select the backend?


I'm really interested in this work, as BTRFS relocation/balance will have
potential benefits, but we need to get it right.


Agree. We will post a V3 with emulation and addressing other comments.
We can take it from there. If you have comments on V2, please send them
our way and we will take them in.


Re: [RFC PATCH v2 0/2] add simple copy support

2020-12-08 Thread Javier González

On 08.12.2020 07:44, Hannes Reinecke wrote:

On 12/7/20 11:12 PM, Douglas Gilbert wrote:

On 2020-12-07 9:56 a.m., Hannes Reinecke wrote:

On 12/7/20 3:11 PM, Christoph Hellwig wrote:

So, I'm really worried about:

  a) a good use case.  GC in f2fs or btrfs seem like good use cases, as
 does accelating dm-kcopyd.  I agree with Damien that 
lifting dm-kcopyd
 to common code would also be really nice.  I'm not 100% 
sure it should

 be a requirement, but it sure would be nice to have
 I don't think just adding an ioctl is enough of a use case 
for complex

 kernel infrastructure.
  b) We had a bunch of different attempts at SCSI XCOPY support 
form IIRC

 Martin, Bart and Mikulas.  I think we need to pull them into this
 discussion, and make sure whatever we do covers the SCSI needs.

And we shouldn't forget that the main issue which killed all 
previous implementations was a missing QoS guarantee.
It's nice to have simply copy, but if the implementation is 
_slower_ than doing it by hand from the OS there is very little 
point in even attempting to do so.
I can't see any provisions for that in the TPAR, leading me to the 
assumption that NVMe simple copy will suffer from the same issue.


So if we can't address this I guess this attempt will fail, too.


I have been doing quite a lot of work and testing in my sg driver rewrite
in the copy and compare area. The baselines for performance are dd and
io_uring-cp (in liburing). There are lots of ways to improve on them. Here
are some:
   - the user data need never pass through the user space (could
 mmap it out during the READ if there is a good reason). Only the
 metadata (e.g. NVMe or SCSI commands) needs to come from the user
 space and errors, if any, reported back to the user space.
   - break a large copy (or compare) into segments, with each segment
 a "comfortable" size for the OS to handle, say 256 KB
   - there is one constraint: the READ in each segment must complete
 before its paired WRITE can commence
 - extra constraint for some zoned disks: WRITEs must be
   issued in order (assuming they are applied in that order, if
   not, need to wait until each WRITE completes)
   - arrange for READ WRITE pair in each segment to share the same bio
   - have multiple slots each holding a segment (i.e. a bio and
 metadata to process a READ-WRITE pair)
   - re-use each slot's bio for the following READ-WRITE pair
   - issue the READs in each slot asynchronously and do an interleaved
 (io)poll for completion. Then issue the paired WRITE
 asynchronously
   - the above "slot" algorithm runs in one thread, so there can be
 multiple threads doing the same algorithm. Segment manager needs
 to be locked (or use an atomics) so that each segment (identified
 by its starting LBAs) is issued once and only once when the
 next thread wants a segment to copy

Running multiple threads gives diminishing or even worsening returns.
Runtime metrics on lock contention and storage bus capacity may help
choosing the number of threads. A simpler approach might be add more
threads until the combined throughput increase is less than 10% say.


The 'compare' that I mention is based on the SCSI VERIFY(BYTCHK=1) command
(or NVMe NVM Compare command). Using dd logic, a disk to disk compare can
be implemented with not much more work than changing the WRITE to a VERIFY
command. This is a different approach to the Linux cmp utility which
READs in both sides and does a memcmp() type operation. Using ramdisks
(from the scsi_debug driver) the compare operation (max ~ 10 GB/s) was
actually faster than the copy (max ~ 7 GB/s). I put this down to WRITE
operations taking a write lock over the store while the VERIFY only
needs a read lock so many VERIFY operations can co-exist on the same
store. Unfortunately on real SAS and NVMe SSDs that I tested the
performance of the VERIFY and NVM Compare commands is underwhelming.
For comparison, using scsi_debug ramdisks, dd copy throughput was
< 1 GB/s and io_uring-cp was around 2-3 GB/s. The system was Ryzen
3600 based.


Which is precisely my concern.
Simple copy might be efficient for one particular implementation, but 
it might be completely off the board for others.
But both will be claiming to support it, and us having no idea whether 
choosing simple copy will speed up matters or not.
Without having a programmatic way to figure out the speed of the 
implementation we have to detect the performance ourselves, like the 
benchmarking loop RAID5 does.
I was hoping to avoid that, and just ask the device/controller; but 
that turned out to be in vain.


I believe it makes sense to do extensive characterization to understand
how the host and device implementation behave. However, I do not believe
we will get far if the requirement is that any acceleration has to
outperform the legacy path under all circumstances and implementations.

At this moment in time, this is a 

Re: [RFC PATCH v2 0/2] add simple copy support

2020-12-08 Thread Johannes Thumshirn
On 07/12/2020 20:27, Javier González wrote:
> Good point. We can share some performance data on how Simple Copy scales
> in terms of bw / latency and the CPU usage. Do you have anything else in
> mind?
> 

With an emulation in the kernel, we could make the usd "backend" 
implementation configurable. So if the emulation is faster, users can select
the emulation, if the device is faster then the device.

Kind of what the crypto and raid code do as well.

I'm really interested in this work, as BTRFS relocation/balance will have 
potential benefits, but we need to get it right.


Re: [RFC PATCH v2 0/2] add simple copy support

2020-12-07 Thread Hannes Reinecke

On 12/7/20 11:12 PM, Douglas Gilbert wrote:

On 2020-12-07 9:56 a.m., Hannes Reinecke wrote:

On 12/7/20 3:11 PM, Christoph Hellwig wrote:

So, I'm really worried about:

  a) a good use case.  GC in f2fs or btrfs seem like good use cases, as
 does accelating dm-kcopyd.  I agree with Damien that lifting 
dm-kcopyd
 to common code would also be really nice.  I'm not 100% sure it 
should

 be a requirement, but it sure would be nice to have
 I don't think just adding an ioctl is enough of a use case for 
complex

 kernel infrastructure.
  b) We had a bunch of different attempts at SCSI XCOPY support form 
IIRC

 Martin, Bart and Mikulas.  I think we need to pull them into this
 discussion, and make sure whatever we do covers the SCSI needs.

And we shouldn't forget that the main issue which killed all previous 
implementations was a missing QoS guarantee.
It's nice to have simply copy, but if the implementation is _slower_ 
than doing it by hand from the OS there is very little point in even 
attempting to do so.
I can't see any provisions for that in the TPAR, leading me to the 
assumption that NVMe simple copy will suffer from the same issue.


So if we can't address this I guess this attempt will fail, too.


I have been doing quite a lot of work and testing in my sg driver rewrite
in the copy and compare area. The baselines for performance are dd and
io_uring-cp (in liburing). There are lots of ways to improve on them. Here
are some:
    - the user data need never pass through the user space (could
  mmap it out during the READ if there is a good reason). Only the
  metadata (e.g. NVMe or SCSI commands) needs to come from the user
  space and errors, if any, reported back to the user space.
    - break a large copy (or compare) into segments, with each segment
  a "comfortable" size for the OS to handle, say 256 KB
    - there is one constraint: the READ in each segment must complete
  before its paired WRITE can commence
  - extra constraint for some zoned disks: WRITEs must be
    issued in order (assuming they are applied in that order, if
    not, need to wait until each WRITE completes)
    - arrange for READ WRITE pair in each segment to share the same bio
    - have multiple slots each holding a segment (i.e. a bio and
  metadata to process a READ-WRITE pair)
    - re-use each slot's bio for the following READ-WRITE pair
    - issue the READs in each slot asynchronously and do an interleaved
  (io)poll for completion. Then issue the paired WRITE
  asynchronously
    - the above "slot" algorithm runs in one thread, so there can be
  multiple threads doing the same algorithm. Segment manager needs
  to be locked (or use an atomics) so that each segment (identified
  by its starting LBAs) is issued once and only once when the
  next thread wants a segment to copy

Running multiple threads gives diminishing or even worsening returns.
Runtime metrics on lock contention and storage bus capacity may help
choosing the number of threads. A simpler approach might be add more
threads until the combined throughput increase is less than 10% say.


The 'compare' that I mention is based on the SCSI VERIFY(BYTCHK=1) command
(or NVMe NVM Compare command). Using dd logic, a disk to disk compare can
be implemented with not much more work than changing the WRITE to a VERIFY
command. This is a different approach to the Linux cmp utility which
READs in both sides and does a memcmp() type operation. Using ramdisks
(from the scsi_debug driver) the compare operation (max ~ 10 GB/s) was
actually faster than the copy (max ~ 7 GB/s). I put this down to WRITE
operations taking a write lock over the store while the VERIFY only
needs a read lock so many VERIFY operations can co-exist on the same
store. Unfortunately on real SAS and NVMe SSDs that I tested the
performance of the VERIFY and NVM Compare commands is underwhelming.
For comparison, using scsi_debug ramdisks, dd copy throughput was
< 1 GB/s and io_uring-cp was around 2-3 GB/s. The system was Ryzen
3600 based.


Which is precisely my concern.
Simple copy might be efficient for one particular implementation, but it 
might be completely off the board for others.
But both will be claiming to support it, and us having no idea whether 
choosing simple copy will speed up matters or not.
Without having a programmatic way to figure out the speed of the 
implementation we have to detect the performance ourselves, like the 
benchmarking loop RAID5 does.
I was hoping to avoid that, and just ask the device/controller; but that 
turned out to be in vain.


Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer


Re: [RFC PATCH v2 0/2] add simple copy support

2020-12-07 Thread Douglas Gilbert

On 2020-12-07 9:56 a.m., Hannes Reinecke wrote:

On 12/7/20 3:11 PM, Christoph Hellwig wrote:

So, I'm really worried about:

  a) a good use case.  GC in f2fs or btrfs seem like good use cases, as
 does accelating dm-kcopyd.  I agree with Damien that lifting dm-kcopyd
 to common code would also be really nice.  I'm not 100% sure it should
 be a requirement, but it sure would be nice to have
 I don't think just adding an ioctl is enough of a use case for complex
 kernel infrastructure.
  b) We had a bunch of different attempts at SCSI XCOPY support form IIRC
 Martin, Bart and Mikulas.  I think we need to pull them into this
 discussion, and make sure whatever we do covers the SCSI needs.

And we shouldn't forget that the main issue which killed all previous 
implementations was a missing QoS guarantee.
It's nice to have simply copy, but if the implementation is _slower_ than doing 
it by hand from the OS there is very little point in even attempting to do so.
I can't see any provisions for that in the TPAR, leading me to the assumption 
that NVMe simple copy will suffer from the same issue.


So if we can't address this I guess this attempt will fail, too.


I have been doing quite a lot of work and testing in my sg driver rewrite
in the copy and compare area. The baselines for performance are dd and
io_uring-cp (in liburing). There are lots of ways to improve on them. Here
are some:
   - the user data need never pass through the user space (could
 mmap it out during the READ if there is a good reason). Only the
 metadata (e.g. NVMe or SCSI commands) needs to come from the user
 space and errors, if any, reported back to the user space.
   - break a large copy (or compare) into segments, with each segment
 a "comfortable" size for the OS to handle, say 256 KB
   - there is one constraint: the READ in each segment must complete
 before its paired WRITE can commence
 - extra constraint for some zoned disks: WRITEs must be
   issued in order (assuming they are applied in that order, if
   not, need to wait until each WRITE completes)
   - arrange for READ WRITE pair in each segment to share the same bio
   - have multiple slots each holding a segment (i.e. a bio and
 metadata to process a READ-WRITE pair)
   - re-use each slot's bio for the following READ-WRITE pair
   - issue the READs in each slot asynchronously and do an interleaved
 (io)poll for completion. Then issue the paired WRITE
 asynchronously
   - the above "slot" algorithm runs in one thread, so there can be
 multiple threads doing the same algorithm. Segment manager needs
 to be locked (or use an atomics) so that each segment (identified
 by its starting LBAs) is issued once and only once when the
 next thread wants a segment to copy

Running multiple threads gives diminishing or even worsening returns.
Runtime metrics on lock contention and storage bus capacity may help
choosing the number of threads. A simpler approach might be add more
threads until the combined throughput increase is less than 10% say.


The 'compare' that I mention is based on the SCSI VERIFY(BYTCHK=1) command
(or NVMe NVM Compare command). Using dd logic, a disk to disk compare can
be implemented with not much more work than changing the WRITE to a VERIFY
command. This is a different approach to the Linux cmp utility which
READs in both sides and does a memcmp() type operation. Using ramdisks
(from the scsi_debug driver) the compare operation (max ~ 10 GB/s) was
actually faster than the copy (max ~ 7 GB/s). I put this down to WRITE
operations taking a write lock over the store while the VERIFY only
needs a read lock so many VERIFY operations can co-exist on the same
store. Unfortunately on real SAS and NVMe SSDs that I tested the
performance of the VERIFY and NVM Compare commands is underwhelming.
For comparison, using scsi_debug ramdisks, dd copy throughput was
< 1 GB/s and io_uring-cp was around 2-3 GB/s. The system was Ryzen
3600 based.

Doug Gilbert


Re: [RFC PATCH v2 0/2] add simple copy support

2020-12-07 Thread Javier González

On 07.12.2020 15:56, Hannes Reinecke wrote:

On 12/7/20 3:11 PM, Christoph Hellwig wrote:

So, I'm really worried about:

 a) a good use case.  GC in f2fs or btrfs seem like good use cases, as
does accelating dm-kcopyd.  I agree with Damien that lifting dm-kcopyd
to common code would also be really nice.  I'm not 100% sure it should
be a requirement, but it sure would be nice to have
I don't think just adding an ioctl is enough of a use case for complex
kernel infrastructure.
 b) We had a bunch of different attempts at SCSI XCOPY support form IIRC
Martin, Bart and Mikulas.  I think we need to pull them into this
discussion, and make sure whatever we do covers the SCSI needs.

And we shouldn't forget that the main issue which killed all previous 
implementations was a missing QoS guarantee.
It's nice to have simply copy, but if the implementation is _slower_ 
than doing it by hand from the OS there is very little point in even 
attempting to do so.
I can't see any provisions for that in the TPAR, leading me to the 
assumption that NVMe simple copy will suffer from the same issue.


So if we can't address this I guess this attempt will fail, too.


Good point. We can share some performance data on how Simple Copy scales
in terms of bw / latency and the CPU usage. Do you have anything else in
mind?


Re: [RFC PATCH v2 0/2] add simple copy support

2020-12-07 Thread Javier González

On 07.12.2020 15:11, Christoph Hellwig wrote:

So, I'm really worried about:

a) a good use case.  GC in f2fs or btrfs seem like good use cases, as
   does accelating dm-kcopyd.  I agree with Damien that lifting dm-kcopyd
   to common code would also be really nice.  I'm not 100% sure it should
   be a requirement, but it sure would be nice to have
   I don't think just adding an ioctl is enough of a use case for complex
   kernel infrastructure.


We are looking at dm-kcopyd. I would have liked to start with a very
specific use case and build from there, but I see Damien's and Keith's
point of having a default path. I believe we can add this to the next
version.


b) We had a bunch of different attempts at SCSI XCOPY support form IIRC
   Martin, Bart and Mikulas.  I think we need to pull them into this
   discussion, and make sure whatever we do covers the SCSI needs.


Agree. We discussed a lot about the scope and agreed that everything
outside of the specifics of Simple Copy requires the input from the ones
that have worked on XCOPY support in the past.



On Fri, Dec 04, 2020 at 03:16:57PM +0530, SelvaKumar S wrote:

This patchset tries to add support for TP4065a ("Simple Copy Command"),
v2020.05.04 ("Ratified")

The Specification can be found in following link.
https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip

This is an RFC. Looking forward for any feedbacks or other alternate
designs for plumbing simple copy to IO stack.

Simple copy command is a copy offloading operation and is  used to copy
multiple contiguous ranges (source_ranges) of LBA's to a single destination
LBA within the device reducing traffic between host and device.

This implementation accepts destination, no of sources and arrays of
source ranges from application and attach it as payload to the bio and
submits to the device.

Following limits are added to queue limits and are exposed in sysfs
to userspace
- *max_copy_sectors* limits the sum of all source_range length
- *max_copy_nr_ranges* limits the number of source ranges
- *max_copy_range_sectors* limit the maximum number of sectors
that can constitute a single source range.

Changes from v1:

1. Fix memory leak in __blkdev_issue_copy
2. Unmark blk_check_copy inline
3. Fix line break in blk_check_copy_eod
4. Remove p checks and made code more readable
5. Don't use bio_set_op_attrs and remove op and set
   bi_opf directly
6. Use struct_size to calculate total_size
7. Fix partition remap of copy destination
8. Remove mcl,mssrl,msrc from nvme_ns
9. Initialize copy queue limits to 0 in nvme_config_copy
10. Remove return in QUEUE_FLAG_COPY check
11. Remove unused OCFS

SelvaKumar S (2):
  block: add simple copy support
  nvme: add simple copy support

 block/blk-core.c  |  94 ++---
 block/blk-lib.c   | 123 ++
 block/blk-merge.c |   2 +
 block/blk-settings.c  |  11 
 block/blk-sysfs.c |  23 +++
 block/blk-zoned.c |   1 +
 block/bounce.c|   1 +
 block/ioctl.c |  43 +
 drivers/nvme/host/core.c  |  87 +++
 include/linux/bio.h   |   1 +
 include/linux/blk_types.h |  15 +
 include/linux/blkdev.h|  15 +
 include/linux/nvme.h  |  43 -
 include/uapi/linux/fs.h   |  13 
 14 files changed, 461 insertions(+), 11 deletions(-)

--
2.25.1

---end quoted text---


Re: [RFC PATCH v2 0/2] add simple copy support

2020-12-07 Thread Hannes Reinecke

On 12/7/20 3:11 PM, Christoph Hellwig wrote:

So, I'm really worried about:

  a) a good use case.  GC in f2fs or btrfs seem like good use cases, as
 does accelating dm-kcopyd.  I agree with Damien that lifting dm-kcopyd
 to common code would also be really nice.  I'm not 100% sure it should
 be a requirement, but it sure would be nice to have
 I don't think just adding an ioctl is enough of a use case for complex
 kernel infrastructure.
  b) We had a bunch of different attempts at SCSI XCOPY support form IIRC
 Martin, Bart and Mikulas.  I think we need to pull them into this
 discussion, and make sure whatever we do covers the SCSI needs.

And we shouldn't forget that the main issue which killed all previous 
implementations was a missing QoS guarantee.
It's nice to have simply copy, but if the implementation is _slower_ 
than doing it by hand from the OS there is very little point in even 
attempting to do so.
I can't see any provisions for that in the TPAR, leading me to the 
assumption that NVMe simple copy will suffer from the same issue.


So if we can't address this I guess this attempt will fail, too.

Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer


Re: [RFC PATCH v2 0/2] add simple copy support

2020-12-07 Thread Christoph Hellwig
So, I'm really worried about:

 a) a good use case.  GC in f2fs or btrfs seem like good use cases, as
does accelating dm-kcopyd.  I agree with Damien that lifting dm-kcopyd
to common code would also be really nice.  I'm not 100% sure it should
be a requirement, but it sure would be nice to have
I don't think just adding an ioctl is enough of a use case for complex
kernel infrastructure.
 b) We had a bunch of different attempts at SCSI XCOPY support form IIRC
Martin, Bart and Mikulas.  I think we need to pull them into this
discussion, and make sure whatever we do covers the SCSI needs.

On Fri, Dec 04, 2020 at 03:16:57PM +0530, SelvaKumar S wrote:
> This patchset tries to add support for TP4065a ("Simple Copy Command"),
> v2020.05.04 ("Ratified")
> 
> The Specification can be found in following link.
> https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip
> 
> This is an RFC. Looking forward for any feedbacks or other alternate
> designs for plumbing simple copy to IO stack.
> 
> Simple copy command is a copy offloading operation and is  used to copy
> multiple contiguous ranges (source_ranges) of LBA's to a single destination
> LBA within the device reducing traffic between host and device.
> 
> This implementation accepts destination, no of sources and arrays of
> source ranges from application and attach it as payload to the bio and
> submits to the device.
> 
> Following limits are added to queue limits and are exposed in sysfs
> to userspace
>   - *max_copy_sectors* limits the sum of all source_range length
>   - *max_copy_nr_ranges* limits the number of source ranges
>   - *max_copy_range_sectors* limit the maximum number of sectors
>   that can constitute a single source range.
> 
> Changes from v1:
> 
> 1. Fix memory leak in __blkdev_issue_copy
> 2. Unmark blk_check_copy inline
> 3. Fix line break in blk_check_copy_eod
> 4. Remove p checks and made code more readable
> 5. Don't use bio_set_op_attrs and remove op and set
>bi_opf directly
> 6. Use struct_size to calculate total_size
> 7. Fix partition remap of copy destination
> 8. Remove mcl,mssrl,msrc from nvme_ns
> 9. Initialize copy queue limits to 0 in nvme_config_copy
> 10. Remove return in QUEUE_FLAG_COPY check
> 11. Remove unused OCFS
> 
> SelvaKumar S (2):
>   block: add simple copy support
>   nvme: add simple copy support
> 
>  block/blk-core.c  |  94 ++---
>  block/blk-lib.c   | 123 ++
>  block/blk-merge.c |   2 +
>  block/blk-settings.c  |  11 
>  block/blk-sysfs.c |  23 +++
>  block/blk-zoned.c |   1 +
>  block/bounce.c|   1 +
>  block/ioctl.c |  43 +
>  drivers/nvme/host/core.c  |  87 +++
>  include/linux/bio.h   |   1 +
>  include/linux/blk_types.h |  15 +
>  include/linux/blkdev.h|  15 +
>  include/linux/nvme.h  |  43 -
>  include/uapi/linux/fs.h   |  13 
>  14 files changed, 461 insertions(+), 11 deletions(-)
> 
> -- 
> 2.25.1
---end quoted text---


Re: [RFC PATCH v2 0/2] add simple copy support

2020-12-07 Thread Damien Le Moal
On 2020/12/07 17:16, javier.g...@samsung.com wrote:
> On 07.12.2020 08:06, Damien Le Moal wrote:
>> On 2020/12/07 16:46, javier.g...@samsung.com wrote:
>>> On 04.12.2020 23:40, Keith Busch wrote:
 On Fri, Dec 04, 2020 at 11:25:12AM +, Damien Le Moal wrote:
> On 2020/12/04 20:02, SelvaKumar S wrote:
>> This patchset tries to add support for TP4065a ("Simple Copy Command"),
>> v2020.05.04 ("Ratified")
>>
>> The Specification can be found in following link.
>> https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip
>>
>> This is an RFC. Looking forward for any feedbacks or other alternate
>> designs for plumbing simple copy to IO stack.
>>
>> Simple copy command is a copy offloading operation and is  used to copy
>> multiple contiguous ranges (source_ranges) of LBA's to a single 
>> destination
>> LBA within the device reducing traffic between host and device.
>>
>> This implementation accepts destination, no of sources and arrays of
>> source ranges from application and attach it as payload to the bio and
>> submits to the device.
>>
>> Following limits are added to queue limits and are exposed in sysfs
>> to userspace
>>  - *max_copy_sectors* limits the sum of all source_range length
>>  - *max_copy_nr_ranges* limits the number of source ranges
>>  - *max_copy_range_sectors* limit the maximum number of sectors
>>  that can constitute a single source range.
>
> Same comment as before. I think this is a good start, but for this to be 
> really
> useful to users and kernel components alike, this really needs copy 
> emulation
> for drives that do not have a native copy feature, similarly to what 
> write zeros
> handling for instance: if the drive does not have a copy command (simple 
> copy
> for NVMe or XCOPY for scsi), then the block layer should issue read/write
> commands to seamlessly execute the copy. Otherwise, this will only serve 
> a small
> niche for users and will not be optimal for FS and DM drivers that could 
> be
> simplified with a generic block layer copy functionality.
>
> This is my 10 cents though, others may differ about this.

 Yes, I agree that copy emulation support should be included with the
 hardware enabled solution.
>>>
>>> Keith, Damien,
>>>
>>> Can we do the block layer emulation with this patchset and then work in
>>> follow-up patchses on (i) the FS interface with F2FS as a first user and
>>> (ii) other HW accelerations such as XCOPY?
>>
>> The initial patchset supporting NVMe simple copy and emulation copy, all 
>> under
>> an API that probably will be similar that of dm-kcopyd will cover all block
>> devices. Other hardware native support for copy functions such as scsi 
>> extended
>> copy can be added later under the hood without any API changes (or minimal 
>> changes).
> 
> Sounds good. That we can do. We will add a new patch for this.
> 
>>
>> I am not sure what you mean by "FS interface for F2FS": the block layer API 
>> for
>> this copy functionality will be what F2FS (and other FSes) will call. That is
>> the interface, no ?
> 
> Essentially yes.. I mean adding the F2FS logic and potentially some
> helpers to the block layer to aid GC.

GC is very much special to each FS. SO I do not think adding helpers to the
block layer will have value. We should stick to a pure block copy API for that
layer.

> 
>>
>>> For XCOPY, I believe we need to have a separate discussion as much works
>>> is already done that we should align to.
>>
>> I think Martin (added to this thread) and others have looked into it but I do
>> not think that anything made it into the kernel yet.
> 
> Exactly. Looking at some of the code posted through time and recalling
> the discussions at LSF/MM, seems like there are a number of things we
> are not addressing here that could be incorporated down the road, such
> as dedicated syscalls / extensions, multi namespace / device support,
> etc.

dm-kcopyd interface supports copy between multiple devices. That of course would
not enable NVMe simple copy use, but that makes the interface generic enough so
that we should not have any problem with other hardware copy functions.

>>
> 


-- 
Damien Le Moal
Western Digital Research


Re: [RFC PATCH v2 0/2] add simple copy support

2020-12-07 Thread javier.g...@samsung.com

On 07.12.2020 08:06, Damien Le Moal wrote:

On 2020/12/07 16:46, javier.g...@samsung.com wrote:

On 04.12.2020 23:40, Keith Busch wrote:

On Fri, Dec 04, 2020 at 11:25:12AM +, Damien Le Moal wrote:

On 2020/12/04 20:02, SelvaKumar S wrote:

This patchset tries to add support for TP4065a ("Simple Copy Command"),
v2020.05.04 ("Ratified")

The Specification can be found in following link.
https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip

This is an RFC. Looking forward for any feedbacks or other alternate
designs for plumbing simple copy to IO stack.

Simple copy command is a copy offloading operation and is  used to copy
multiple contiguous ranges (source_ranges) of LBA's to a single destination
LBA within the device reducing traffic between host and device.

This implementation accepts destination, no of sources and arrays of
source ranges from application and attach it as payload to the bio and
submits to the device.

Following limits are added to queue limits and are exposed in sysfs
to userspace
- *max_copy_sectors* limits the sum of all source_range length
- *max_copy_nr_ranges* limits the number of source ranges
- *max_copy_range_sectors* limit the maximum number of sectors
that can constitute a single source range.


Same comment as before. I think this is a good start, but for this to be really
useful to users and kernel components alike, this really needs copy emulation
for drives that do not have a native copy feature, similarly to what write zeros
handling for instance: if the drive does not have a copy command (simple copy
for NVMe or XCOPY for scsi), then the block layer should issue read/write
commands to seamlessly execute the copy. Otherwise, this will only serve a small
niche for users and will not be optimal for FS and DM drivers that could be
simplified with a generic block layer copy functionality.

This is my 10 cents though, others may differ about this.


Yes, I agree that copy emulation support should be included with the
hardware enabled solution.


Keith, Damien,

Can we do the block layer emulation with this patchset and then work in
follow-up patchses on (i) the FS interface with F2FS as a first user and
(ii) other HW accelerations such as XCOPY?


The initial patchset supporting NVMe simple copy and emulation copy, all under
an API that probably will be similar that of dm-kcopyd will cover all block
devices. Other hardware native support for copy functions such as scsi extended
copy can be added later under the hood without any API changes (or minimal 
changes).


Sounds good. That we can do. We will add a new patch for this.



I am not sure what you mean by "FS interface for F2FS": the block layer API for
this copy functionality will be what F2FS (and other FSes) will call. That is
the interface, no ?


Essentially yes.. I mean adding the F2FS logic and potentially some
helpers to the block layer to aid GC.




For XCOPY, I believe we need to have a separate discussion as much works
is already done that we should align to.


I think Martin (added to this thread) and others have looked into it but I do
not think that anything made it into the kernel yet.


Exactly. Looking at some of the code posted through time and recalling
the discussions at LSF/MM, seems like there are a number of things we
are not addressing here that could be incorporated down the road, such
as dedicated syscalls / extensions, multi namespace / device support,
etc.




Re: [RFC PATCH v2 0/2] add simple copy support

2020-12-07 Thread Damien Le Moal
On 2020/12/07 16:46, javier.g...@samsung.com wrote:
> On 04.12.2020 23:40, Keith Busch wrote:
>> On Fri, Dec 04, 2020 at 11:25:12AM +, Damien Le Moal wrote:
>>> On 2020/12/04 20:02, SelvaKumar S wrote:
 This patchset tries to add support for TP4065a ("Simple Copy Command"),
 v2020.05.04 ("Ratified")

 The Specification can be found in following link.
 https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip

 This is an RFC. Looking forward for any feedbacks or other alternate
 designs for plumbing simple copy to IO stack.

 Simple copy command is a copy offloading operation and is  used to copy
 multiple contiguous ranges (source_ranges) of LBA's to a single destination
 LBA within the device reducing traffic between host and device.

 This implementation accepts destination, no of sources and arrays of
 source ranges from application and attach it as payload to the bio and
 submits to the device.

 Following limits are added to queue limits and are exposed in sysfs
 to userspace
- *max_copy_sectors* limits the sum of all source_range length
- *max_copy_nr_ranges* limits the number of source ranges
- *max_copy_range_sectors* limit the maximum number of sectors
that can constitute a single source range.
>>>
>>> Same comment as before. I think this is a good start, but for this to be 
>>> really
>>> useful to users and kernel components alike, this really needs copy 
>>> emulation
>>> for drives that do not have a native copy feature, similarly to what write 
>>> zeros
>>> handling for instance: if the drive does not have a copy command (simple 
>>> copy
>>> for NVMe or XCOPY for scsi), then the block layer should issue read/write
>>> commands to seamlessly execute the copy. Otherwise, this will only serve a 
>>> small
>>> niche for users and will not be optimal for FS and DM drivers that could be
>>> simplified with a generic block layer copy functionality.
>>>
>>> This is my 10 cents though, others may differ about this.
>>
>> Yes, I agree that copy emulation support should be included with the
>> hardware enabled solution.
> 
> Keith, Damien,
> 
> Can we do the block layer emulation with this patchset and then work in
> follow-up patchses on (i) the FS interface with F2FS as a first user and
> (ii) other HW accelerations such as XCOPY?

The initial patchset supporting NVMe simple copy and emulation copy, all under
an API that probably will be similar that of dm-kcopyd will cover all block
devices. Other hardware native support for copy functions such as scsi extended
copy can be added later under the hood without any API changes (or minimal 
changes).

I am not sure what you mean by "FS interface for F2FS": the block layer API for
this copy functionality will be what F2FS (and other FSes) will call. That is
the interface, no ?

> For XCOPY, I believe we need to have a separate discussion as much works
> is already done that we should align to.

I think Martin (added to this thread) and others have looked into it but I do
not think that anything made it into the kernel yet.


-- 
Damien Le Moal
Western Digital Research


Re: [RFC PATCH v2 0/2] add simple copy support

2020-12-06 Thread javier.g...@samsung.com

On 04.12.2020 23:40, Keith Busch wrote:

On Fri, Dec 04, 2020 at 11:25:12AM +, Damien Le Moal wrote:

On 2020/12/04 20:02, SelvaKumar S wrote:
> This patchset tries to add support for TP4065a ("Simple Copy Command"),
> v2020.05.04 ("Ratified")
>
> The Specification can be found in following link.
> https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip
>
> This is an RFC. Looking forward for any feedbacks or other alternate
> designs for plumbing simple copy to IO stack.
>
> Simple copy command is a copy offloading operation and is  used to copy
> multiple contiguous ranges (source_ranges) of LBA's to a single destination
> LBA within the device reducing traffic between host and device.
>
> This implementation accepts destination, no of sources and arrays of
> source ranges from application and attach it as payload to the bio and
> submits to the device.
>
> Following limits are added to queue limits and are exposed in sysfs
> to userspace
>- *max_copy_sectors* limits the sum of all source_range length
>- *max_copy_nr_ranges* limits the number of source ranges
>- *max_copy_range_sectors* limit the maximum number of sectors
>that can constitute a single source range.

Same comment as before. I think this is a good start, but for this to be really
useful to users and kernel components alike, this really needs copy emulation
for drives that do not have a native copy feature, similarly to what write zeros
handling for instance: if the drive does not have a copy command (simple copy
for NVMe or XCOPY for scsi), then the block layer should issue read/write
commands to seamlessly execute the copy. Otherwise, this will only serve a small
niche for users and will not be optimal for FS and DM drivers that could be
simplified with a generic block layer copy functionality.

This is my 10 cents though, others may differ about this.


Yes, I agree that copy emulation support should be included with the
hardware enabled solution.


Keith, Damien,

Can we do the block layer emulation with this patchset and then work in
follow-up patchses on (i) the FS interface with F2FS as a first user and
(ii) other HW accelerations such as XCOPY?

For XCOPY, I believe we need to have a separate discussion as much works
is already done that we should align to.



Re: [RFC PATCH v2 0/2] add simple copy support

2020-12-04 Thread Keith Busch
On Fri, Dec 04, 2020 at 11:25:12AM +, Damien Le Moal wrote:
> On 2020/12/04 20:02, SelvaKumar S wrote:
> > This patchset tries to add support for TP4065a ("Simple Copy Command"),
> > v2020.05.04 ("Ratified")
> > 
> > The Specification can be found in following link.
> > https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip
> > 
> > This is an RFC. Looking forward for any feedbacks or other alternate
> > designs for plumbing simple copy to IO stack.
> > 
> > Simple copy command is a copy offloading operation and is  used to copy
> > multiple contiguous ranges (source_ranges) of LBA's to a single destination
> > LBA within the device reducing traffic between host and device.
> > 
> > This implementation accepts destination, no of sources and arrays of
> > source ranges from application and attach it as payload to the bio and
> > submits to the device.
> > 
> > Following limits are added to queue limits and are exposed in sysfs
> > to userspace
> > - *max_copy_sectors* limits the sum of all source_range length
> > - *max_copy_nr_ranges* limits the number of source ranges
> > - *max_copy_range_sectors* limit the maximum number of sectors
> > that can constitute a single source range.
> 
> Same comment as before. I think this is a good start, but for this to be 
> really
> useful to users and kernel components alike, this really needs copy emulation
> for drives that do not have a native copy feature, similarly to what write 
> zeros
> handling for instance: if the drive does not have a copy command (simple copy
> for NVMe or XCOPY for scsi), then the block layer should issue read/write
> commands to seamlessly execute the copy. Otherwise, this will only serve a 
> small
> niche for users and will not be optimal for FS and DM drivers that could be
> simplified with a generic block layer copy functionality.
> 
> This is my 10 cents though, others may differ about this.

Yes, I agree that copy emulation support should be included with the
hardware enabled solution.


Re: [RFC PATCH v2 0/2] add simple copy support

2020-12-04 Thread Damien Le Moal
On 2020/12/04 20:02, SelvaKumar S wrote:
> This patchset tries to add support for TP4065a ("Simple Copy Command"),
> v2020.05.04 ("Ratified")
> 
> The Specification can be found in following link.
> https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip
> 
> This is an RFC. Looking forward for any feedbacks or other alternate
> designs for plumbing simple copy to IO stack.
> 
> Simple copy command is a copy offloading operation and is  used to copy
> multiple contiguous ranges (source_ranges) of LBA's to a single destination
> LBA within the device reducing traffic between host and device.
> 
> This implementation accepts destination, no of sources and arrays of
> source ranges from application and attach it as payload to the bio and
> submits to the device.
> 
> Following limits are added to queue limits and are exposed in sysfs
> to userspace
>   - *max_copy_sectors* limits the sum of all source_range length
>   - *max_copy_nr_ranges* limits the number of source ranges
>   - *max_copy_range_sectors* limit the maximum number of sectors
>   that can constitute a single source range.

Same comment as before. I think this is a good start, but for this to be really
useful to users and kernel components alike, this really needs copy emulation
for drives that do not have a native copy feature, similarly to what write zeros
handling for instance: if the drive does not have a copy command (simple copy
for NVMe or XCOPY for scsi), then the block layer should issue read/write
commands to seamlessly execute the copy. Otherwise, this will only serve a small
niche for users and will not be optimal for FS and DM drivers that could be
simplified with a generic block layer copy functionality.

This is my 10 cents though, others may differ about this.

> 
> Changes from v1:
> 
> 1. Fix memory leak in __blkdev_issue_copy
> 2. Unmark blk_check_copy inline
> 3. Fix line break in blk_check_copy_eod
> 4. Remove p checks and made code more readable
> 5. Don't use bio_set_op_attrs and remove op and set
>bi_opf directly
> 6. Use struct_size to calculate total_size
> 7. Fix partition remap of copy destination
> 8. Remove mcl,mssrl,msrc from nvme_ns
> 9. Initialize copy queue limits to 0 in nvme_config_copy
> 10. Remove return in QUEUE_FLAG_COPY check
> 11. Remove unused OCFS
> 
> SelvaKumar S (2):
>   block: add simple copy support
>   nvme: add simple copy support
> 
>  block/blk-core.c  |  94 ++---
>  block/blk-lib.c   | 123 ++
>  block/blk-merge.c |   2 +
>  block/blk-settings.c  |  11 
>  block/blk-sysfs.c |  23 +++
>  block/blk-zoned.c |   1 +
>  block/bounce.c|   1 +
>  block/ioctl.c |  43 +
>  drivers/nvme/host/core.c  |  87 +++
>  include/linux/bio.h   |   1 +
>  include/linux/blk_types.h |  15 +
>  include/linux/blkdev.h|  15 +
>  include/linux/nvme.h  |  43 -
>  include/uapi/linux/fs.h   |  13 
>  14 files changed, 461 insertions(+), 11 deletions(-)
> 


-- 
Damien Le Moal
Western Digital Research


[RFC PATCH v2 0/2] add simple copy support

2020-12-04 Thread SelvaKumar S
This patchset tries to add support for TP4065a ("Simple Copy Command"),
v2020.05.04 ("Ratified")

The Specification can be found in following link.
https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip

This is an RFC. Looking forward for any feedbacks or other alternate
designs for plumbing simple copy to IO stack.

Simple copy command is a copy offloading operation and is  used to copy
multiple contiguous ranges (source_ranges) of LBA's to a single destination
LBA within the device reducing traffic between host and device.

This implementation accepts destination, no of sources and arrays of
source ranges from application and attach it as payload to the bio and
submits to the device.

Following limits are added to queue limits and are exposed in sysfs
to userspace
- *max_copy_sectors* limits the sum of all source_range length
- *max_copy_nr_ranges* limits the number of source ranges
- *max_copy_range_sectors* limit the maximum number of sectors
that can constitute a single source range.

Changes from v1:

1. Fix memory leak in __blkdev_issue_copy
2. Unmark blk_check_copy inline
3. Fix line break in blk_check_copy_eod
4. Remove p checks and made code more readable
5. Don't use bio_set_op_attrs and remove op and set
   bi_opf directly
6. Use struct_size to calculate total_size
7. Fix partition remap of copy destination
8. Remove mcl,mssrl,msrc from nvme_ns
9. Initialize copy queue limits to 0 in nvme_config_copy
10. Remove return in QUEUE_FLAG_COPY check
11. Remove unused OCFS

SelvaKumar S (2):
  block: add simple copy support
  nvme: add simple copy support

 block/blk-core.c  |  94 ++---
 block/blk-lib.c   | 123 ++
 block/blk-merge.c |   2 +
 block/blk-settings.c  |  11 
 block/blk-sysfs.c |  23 +++
 block/blk-zoned.c |   1 +
 block/bounce.c|   1 +
 block/ioctl.c |  43 +
 drivers/nvme/host/core.c  |  87 +++
 include/linux/bio.h   |   1 +
 include/linux/blk_types.h |  15 +
 include/linux/blkdev.h|  15 +
 include/linux/nvme.h  |  43 -
 include/uapi/linux/fs.h   |  13 
 14 files changed, 461 insertions(+), 11 deletions(-)

-- 
2.25.1