Re: [RFC PATCH v2 0/2] add simple copy support

2020-12-07 Thread javier.g...@samsung.com

On 07.12.2020 08:06, Damien Le Moal wrote:

On 2020/12/07 16:46, javier.g...@samsung.com wrote:

On 04.12.2020 23:40, Keith Busch wrote:

On Fri, Dec 04, 2020 at 11:25:12AM +, Damien Le Moal wrote:

On 2020/12/04 20:02, SelvaKumar S wrote:

This patchset tries to add support for TP4065a ("Simple Copy Command"),
v2020.05.04 ("Ratified")

The Specification can be found in following link.
https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip

This is an RFC. Looking forward for any feedbacks or other alternate
designs for plumbing simple copy to IO stack.

Simple copy command is a copy offloading operation and is  used to copy
multiple contiguous ranges (source_ranges) of LBA's to a single destination
LBA within the device reducing traffic between host and device.

This implementation accepts destination, no of sources and arrays of
source ranges from application and attach it as payload to the bio and
submits to the device.

Following limits are added to queue limits and are exposed in sysfs
to userspace
- *max_copy_sectors* limits the sum of all source_range length
- *max_copy_nr_ranges* limits the number of source ranges
- *max_copy_range_sectors* limit the maximum number of sectors
that can constitute a single source range.


Same comment as before. I think this is a good start, but for this to be really
useful to users and kernel components alike, this really needs copy emulation
for drives that do not have a native copy feature, similarly to what write zeros
handling for instance: if the drive does not have a copy command (simple copy
for NVMe or XCOPY for scsi), then the block layer should issue read/write
commands to seamlessly execute the copy. Otherwise, this will only serve a small
niche for users and will not be optimal for FS and DM drivers that could be
simplified with a generic block layer copy functionality.

This is my 10 cents though, others may differ about this.


Yes, I agree that copy emulation support should be included with the
hardware enabled solution.


Keith, Damien,

Can we do the block layer emulation with this patchset and then work in
follow-up patchses on (i) the FS interface with F2FS as a first user and
(ii) other HW accelerations such as XCOPY?


The initial patchset supporting NVMe simple copy and emulation copy, all under
an API that probably will be similar that of dm-kcopyd will cover all block
devices. Other hardware native support for copy functions such as scsi extended
copy can be added later under the hood without any API changes (or minimal 
changes).


Sounds good. That we can do. We will add a new patch for this.



I am not sure what you mean by "FS interface for F2FS": the block layer API for
this copy functionality will be what F2FS (and other FSes) will call. That is
the interface, no ?


Essentially yes.. I mean adding the F2FS logic and potentially some
helpers to the block layer to aid GC.




For XCOPY, I believe we need to have a separate discussion as much works
is already done that we should align to.


I think Martin (added to this thread) and others have looked into it but I do
not think that anything made it into the kernel yet.


Exactly. Looking at some of the code posted through time and recalling
the discussions at LSF/MM, seems like there are a number of things we
are not addressing here that could be incorporated down the road, such
as dedicated syscalls / extensions, multi namespace / device support,
etc.




Re: [RFC PATCH v2 0/2] add simple copy support

2020-12-06 Thread javier.g...@samsung.com

On 04.12.2020 23:40, Keith Busch wrote:

On Fri, Dec 04, 2020 at 11:25:12AM +, Damien Le Moal wrote:

On 2020/12/04 20:02, SelvaKumar S wrote:
> This patchset tries to add support for TP4065a ("Simple Copy Command"),
> v2020.05.04 ("Ratified")
>
> The Specification can be found in following link.
> https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip
>
> This is an RFC. Looking forward for any feedbacks or other alternate
> designs for plumbing simple copy to IO stack.
>
> Simple copy command is a copy offloading operation and is  used to copy
> multiple contiguous ranges (source_ranges) of LBA's to a single destination
> LBA within the device reducing traffic between host and device.
>
> This implementation accepts destination, no of sources and arrays of
> source ranges from application and attach it as payload to the bio and
> submits to the device.
>
> Following limits are added to queue limits and are exposed in sysfs
> to userspace
>- *max_copy_sectors* limits the sum of all source_range length
>- *max_copy_nr_ranges* limits the number of source ranges
>- *max_copy_range_sectors* limit the maximum number of sectors
>that can constitute a single source range.

Same comment as before. I think this is a good start, but for this to be really
useful to users and kernel components alike, this really needs copy emulation
for drives that do not have a native copy feature, similarly to what write zeros
handling for instance: if the drive does not have a copy command (simple copy
for NVMe or XCOPY for scsi), then the block layer should issue read/write
commands to seamlessly execute the copy. Otherwise, this will only serve a small
niche for users and will not be optimal for FS and DM drivers that could be
simplified with a generic block layer copy functionality.

This is my 10 cents though, others may differ about this.


Yes, I agree that copy emulation support should be included with the
hardware enabled solution.


Keith, Damien,

Can we do the block layer emulation with this patchset and then work in
follow-up patchses on (i) the FS interface with F2FS as a first user and
(ii) other HW accelerations such as XCOPY?

For XCOPY, I believe we need to have a separate discussion as much works
is already done that we should align to.



Re: [PATCH v2 0/2] zone-append support in io-uring and aio

2020-06-26 Thread javier.g...@samsung.com

On 26.06.2020 06:56, Damien Le Moal wrote:

On 2020/06/26 15:37, javier.g...@samsung.com wrote:

On 26.06.2020 03:11, Damien Le Moal wrote:

On 2020/06/26 2:18, Kanchan Joshi wrote:

[Revised as per feedback from Damien, Pavel, Jens, Christoph, Matias, Wilcox]

This patchset enables zone-append using io-uring/linux-aio, on block IO path.
Purpose is to provide zone-append consumption ability to applications which are
using zoned-block-device directly.

The application may specify RWF_ZONE_APPEND flag with write when it wants to
send zone-append. RWF_* flags work with a certain subset of APIs e.g. uring,
aio, and pwritev2. An error is reported if zone-append is requested using
pwritev2. It is not in the scope of this patchset to support pwritev2 or any
other sync write API for reasons described later.

Zone-append completion result --->
With zone-append, where write took place can only be known after completion.
So apart from usual return value of write, additional mean is needed to obtain
the actual written location.

In aio, this is returned to application using res2 field of io_event -

struct io_event {
__u64   data;   /* the data field from the iocb */
__u64   obj;/* what iocb this event came from */
__s64   res;/* result code for this event */
__s64   res2;   /* secondary result */
};

In io-uring, cqe->flags is repurposed for zone-append result.

struct io_uring_cqe {
__u64   user_data;  /* sqe->data submission passed back */
__s32   res;/* result code for this event */
__u32   flags;
};

Since 32 bit flags is not sufficient, we choose to return zone-relative offset
in sector/512b units. This can cover zone-size represented by chunk_sectors.
Applications will have the trouble to combine this with zone start to know
disk-relative offset. But if more bits are obtained by pulling from res field
that too would compel application to interpret res field differently, and it
seems more painstaking than the former option.
To keep uniformity, even with aio, zone-relative offset is returned.


I am really not a fan of this, to say the least. The input is byte offset, the
output is 512B relative sector count... Arg... We really cannot do better than
that ?

At the very least, byte relative offset ? The main reason is that this is
_somewhat_ acceptable for raw block device accesses since the "sector"
abstraction has a clear meaning, but once we add iomap/zonefs async zone append
support, we really will want to have byte unit as the interface is regular
files, not block device file. We could argue that 512B sector unit is still
around even for files (e.g. block counts in file stat). Bu the different unit
for input and output of one operation is really ugly. This is not nice for the 
user.



You can refer to the discussion with Jens, Pavel and Alex on the uring
interface. With the bits we have and considering the maximun zone size
supported, there is no space for a byte relative offset. We can take
some bits from cqe->res, but we were afraid this is not very
future-proof. Do you have a better idea?


If you can take 8 bits, that gives you 40 bits, enough to support byte relative
offsets for any zone size defined as a number of 512B sectors using an unsigned
int. Max zone size is 2^31 sectors in that case, so 2^40 bytes. Unless I am
already too tired and my math is failing me...


Yes, the match is correct. I was thinking more of the bits being needed
for other use-case that could collide with append. We considered this
and discard it for being messy - when Pavel brought up the 512B
alignment we saw it as a good alternative.

Note too that we would be able to translate to a byte offset in
iouring.h too so the user would not need to think of this.

I do not feel strongly on this, so the one that better fits the current
and near-future for uring, that is the one we will send on V3. Will give
it until next week for others to comment too.



zone size is defined by chunk_sectors, which is used for raid and software raids
too. This has been an unsigned int forever. I do not see the need for changing
this to a 64bit anytime soon, if ever. A raid with a stripe size larger than 1TB
does not really make any sense. Same for zone size...


Yes. I think already max zone sizes are pretty huge. But yes, this might
change, so we will take it when it happens.

[...]

Javier


Re: [PATCH v2 0/2] zone-append support in io-uring and aio

2020-06-26 Thread javier.g...@samsung.com

On 26.06.2020 03:11, Damien Le Moal wrote:

On 2020/06/26 2:18, Kanchan Joshi wrote:

[Revised as per feedback from Damien, Pavel, Jens, Christoph, Matias, Wilcox]

This patchset enables zone-append using io-uring/linux-aio, on block IO path.
Purpose is to provide zone-append consumption ability to applications which are
using zoned-block-device directly.

The application may specify RWF_ZONE_APPEND flag with write when it wants to
send zone-append. RWF_* flags work with a certain subset of APIs e.g. uring,
aio, and pwritev2. An error is reported if zone-append is requested using
pwritev2. It is not in the scope of this patchset to support pwritev2 or any
other sync write API for reasons described later.

Zone-append completion result --->
With zone-append, where write took place can only be known after completion.
So apart from usual return value of write, additional mean is needed to obtain
the actual written location.

In aio, this is returned to application using res2 field of io_event -

struct io_event {
__u64   data;   /* the data field from the iocb */
__u64   obj;/* what iocb this event came from */
__s64   res;/* result code for this event */
__s64   res2;   /* secondary result */
};

In io-uring, cqe->flags is repurposed for zone-append result.

struct io_uring_cqe {
__u64   user_data;  /* sqe->data submission passed back */
__s32   res;/* result code for this event */
__u32   flags;
};

Since 32 bit flags is not sufficient, we choose to return zone-relative offset
in sector/512b units. This can cover zone-size represented by chunk_sectors.
Applications will have the trouble to combine this with zone start to know
disk-relative offset. But if more bits are obtained by pulling from res field
that too would compel application to interpret res field differently, and it
seems more painstaking than the former option.
To keep uniformity, even with aio, zone-relative offset is returned.


I am really not a fan of this, to say the least. The input is byte offset, the
output is 512B relative sector count... Arg... We really cannot do better than
that ?

At the very least, byte relative offset ? The main reason is that this is
_somewhat_ acceptable for raw block device accesses since the "sector"
abstraction has a clear meaning, but once we add iomap/zonefs async zone append
support, we really will want to have byte unit as the interface is regular
files, not block device file. We could argue that 512B sector unit is still
around even for files (e.g. block counts in file stat). Bu the different unit
for input and output of one operation is really ugly. This is not nice for the 
user.



You can refer to the discussion with Jens, Pavel and Alex on the uring
interface. With the bits we have and considering the maximun zone size
supported, there is no space for a byte relative offset. We can take
some bits from cqe->res, but we were afraid this is not very
future-proof. Do you have a better idea?




Append using io_uring fixed-buffer --->
This is flagged as not-supported at the moment. Reason being, for fixed-buffer
io-uring sends iov_iter of bvec type. But current append-infra in block-layer
does not support such iov_iter.

Block IO vs File IO --->
For now, the user zone-append interface is supported only for 
zoned-block-device.
Regular files/block-devices are not supported. Regular file-system (e.g. F2FS)
will not need this anyway, because zone peculiarities are abstracted within FS.
At this point, ZoneFS also likes to use append implicitly rather than 
explicitly.
But if/when ZoneFS starts supporting explicit/on-demand zone-append, the check
allowing-only-block-device should be changed.


Sure, but I think the interface is still a problem. I am not super happy about
the 512B sector unit. Zonefs will be the only file system that will be impacted
since other normal POSIX file system will not have zone append interface for
users. So this is a limited problem. Still, even for raw block device files
accesses, POSIX system calls use Byte unit everywhere. Let's try to use that.

For aio, it is easy since res2 is unsigned long long. For io_uring, as discussed
already, we can still 8 bits from the cqe res. All  you need is to add a small
helper function in userspace iouring.h to simplify the work of the application
to get that result.


Ok. See above. We can do this.

Jens: Do you see this as a problem in the future?

[...]

Javier


Re: [PATCH 3/3] io_uring: add support for zone-append

2020-06-21 Thread javier.g...@samsung.com

On 19.06.2020 09:44, Jens Axboe wrote:

On 6/19/20 9:40 AM, Matias Bjørling wrote:

On 19/06/2020 17.20, Jens Axboe wrote:

On 6/19/20 9:14 AM, Matias Bjørling wrote:

On 19/06/2020 16.18, Jens Axboe wrote:

On 6/19/20 5:15 AM, Matias Bjørling wrote:

On 19/06/2020 11.41, javier.g...@samsung.com wrote:

Jens,

Would you have time to answer a question below in this thread?

On 18.06.2020 11:11, javier.g...@samsung.com wrote:

On 18.06.2020 08:47, Damien Le Moal wrote:

On 2020/06/18 17:35, javier.g...@samsung.com wrote:

On 18.06.2020 07:39, Damien Le Moal wrote:

On 2020/06/18 2:27, Kanchan Joshi wrote:

From: Selvakumar S 

Introduce three new opcodes for zone-append -

IORING_OP_ZONE_APPEND : non-vectord, similiar to
IORING_OP_WRITE
IORING_OP_ZONE_APPENDV: vectored, similar to IORING_OP_WRITEV
IORING_OP_ZONE_APPEND_FIXED : append using fixed-buffers

Repurpose cqe->flags to return zone-relative offset.

Signed-off-by: SelvaKumar S 
Signed-off-by: Kanchan Joshi 
Signed-off-by: Nitesh Shetty 
Signed-off-by: Javier Gonzalez 
---
fs/io_uring.c | 72
+--
include/uapi/linux/io_uring.h |  8 -
2 files changed, 77 insertions(+), 3 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 155f3d8..c14c873 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -649,6 +649,10 @@ struct io_kiocb {
  unsigned longfsize;
  u64user_data;
  u32result;
+#ifdef CONFIG_BLK_DEV_ZONED
+/* zone-relative offset for append, in bytes */
+u32append_offset;

this can overflow. u64 is needed.

We chose to do it this way to start with because struct io_uring_cqe
only has space for u32 when we reuse the flags.

We can of course create a new cqe structure, but that will come with
larger changes to io_uring for supporting append.

Do you believe this is a better approach?

The problem is that zone size are 32 bits in the kernel, as a number
of sectors.
So any device that has a zone size smaller or equal to 2^31 512B
sectors can be
accepted. Using a zone relative offset in bytes for returning zone
append result
is OK-ish, but to match the kernel supported range of possible zone
size, you
need 31+9 bits... 32 does not cut it.

Agree. Our initial assumption was that u32 would cover current zone size
requirements, but if this is a no-go, we will take the longer path.

Converting to u64 will require a new version of io_uring_cqe, where we
extend at least 32 bits. I believe this will need a whole new allocation
and probably ioctl().

Is this an acceptable change for you? We will of course add support for
liburing when we agree on the right way to do this.

I took a quick look at the code. No expert, but why not use the existing
userdata variable? use the lowest bits (40 bits) for the Zone Starting
LBA, and use the highest (24 bits) as index into the completion data
structure?

If you want to pass the memory address (same as what fio does) for the
data structure used for completion, one may also play some tricks by
using a relative memory address to the data structure. For example, the
x86_64 architecture uses 48 address bits for its memory addresses. With
24 bit, one can allocate the completion entries in a 32MB memory range,
and then use base_address + index to get back to the completion data
structure specified in the sqe.

For any current request, sqe->user_data is just provided back as
cqe->user_data. This would make these requests behave differently
from everything else in that sense, which seems very confusing to me
if I was an application writer.

But generally I do agree with you, there are lots of ways to make
< 64-bit work as a tag without losing anything or having to jump
through hoops to do so. The lack of consistency introduced by having
zone append work differently is ugly, though.


Yep, agree, and extending to three cachelines is big no-go. We could add
a flag that said the kernel has changes the userdata variable. That'll
make it very explicit.

Don't like that either, as it doesn't really change the fact that you're
now doing something very different with the user_data field, which is
just supposed to be passed in/out directly. Adding a random flag to
signal this behavior isn't very explicit either, imho. It's still some
out-of-band (ish) notification of behavior that is different from any
other command. This is very different from having a flag that says
"there's extra information in this other field", which is much cleaner.


Ok. Then it's pulling in the bits from cqe->res and cqe->flags that you
mention in the other mail. Sounds good.


I think that's the best approach, if we need > 32-bits. Maybe we can get
by just using ->res, if we switch to multiples of 512b instead for the
result like Pavel suggested. That'd provide enough room in ->res, and
would be preferable imho. But if we do need > 32-bits, then we can use
this approach.


Soun

Re: [PATCH 3/3] io_uring: add support for zone-append

2020-06-21 Thread javier.g...@samsung.com

On 19.06.2020 09:02, Jens Axboe wrote:

On 6/19/20 8:59 AM, Pavel Begunkov wrote:

On 19/06/2020 17:15, Jens Axboe wrote:

On 6/19/20 3:41 AM, javier.g...@samsung.com wrote:

Jens,

Would you have time to answer a question below in this thread?

On 18.06.2020 11:11, javier.g...@samsung.com wrote:

On 18.06.2020 08:47, Damien Le Moal wrote:

On 2020/06/18 17:35, javier.g...@samsung.com wrote:

On 18.06.2020 07:39, Damien Le Moal wrote:

On 2020/06/18 2:27, Kanchan Joshi wrote:

From: Selvakumar S 

Introduce three new opcodes for zone-append -

  IORING_OP_ZONE_APPEND : non-vectord, similiar to IORING_OP_WRITE
  IORING_OP_ZONE_APPENDV: vectored, similar to IORING_OP_WRITEV
  IORING_OP_ZONE_APPEND_FIXED : append using fixed-buffers

Repurpose cqe->flags to return zone-relative offset.

Signed-off-by: SelvaKumar S 
Signed-off-by: Kanchan Joshi 
Signed-off-by: Nitesh Shetty 
Signed-off-by: Javier Gonzalez 
---
fs/io_uring.c | 72 +--
include/uapi/linux/io_uring.h |  8 -
2 files changed, 77 insertions(+), 3 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 155f3d8..c14c873 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -649,6 +649,10 @@ struct io_kiocb {
unsigned long   fsize;
u64 user_data;
u32 result;
+#ifdef CONFIG_BLK_DEV_ZONED
+   /* zone-relative offset for append, in bytes */
+   u32 append_offset;


this can overflow. u64 is needed.


We chose to do it this way to start with because struct io_uring_cqe
only has space for u32 when we reuse the flags.

We can of course create a new cqe structure, but that will come with
larger changes to io_uring for supporting append.

Do you believe this is a better approach?


The problem is that zone size are 32 bits in the kernel, as a number
of sectors.  So any device that has a zone size smaller or equal to
2^31 512B sectors can be accepted. Using a zone relative offset in
bytes for returning zone append result is OK-ish, but to match the
kernel supported range of possible zone size, you need 31+9 bits...
32 does not cut it.


Agree. Our initial assumption was that u32 would cover current zone size
requirements, but if this is a no-go, we will take the longer path.


Converting to u64 will require a new version of io_uring_cqe, where we
extend at least 32 bits. I believe this will need a whole new allocation
and probably ioctl().

Is this an acceptable change for you? We will of course add support for
liburing when we agree on the right way to do this.


If you need 64-bit of return value, then it's not going to work. Even
with the existing patches, reusing cqe->flags isn't going to fly, as
it would conflict with eg doing zone append writes with automatic
buffer selection.


Buffer selection is for reads/recv kind of requests, but appends
are writes. In theory they can co-exist using cqe->flags.


Yeah good point, since it's just writes, doesn't matter. But the other
point still stands, it could potentially conflict with other flags, but
I guess only to the extent where both flags would need extra storage in
->flags. So not a huge concern imho.


Very good point Pavel!

If co-existing with the current flags is an option, I'll explore this
for the next version.

Thanks Jens and Pavel for the time and ideas!

Javier


Re: [PATCH 3/3] io_uring: add support for zone-append

2020-06-19 Thread javier.g...@samsung.com

Jens,

Would you have time to answer a question below in this thread?

On 18.06.2020 11:11, javier.g...@samsung.com wrote:

On 18.06.2020 08:47, Damien Le Moal wrote:

On 2020/06/18 17:35, javier.g...@samsung.com wrote:

On 18.06.2020 07:39, Damien Le Moal wrote:

On 2020/06/18 2:27, Kanchan Joshi wrote:

From: Selvakumar S 

Introduce three new opcodes for zone-append -

  IORING_OP_ZONE_APPEND : non-vectord, similiar to IORING_OP_WRITE
  IORING_OP_ZONE_APPENDV: vectored, similar to IORING_OP_WRITEV
  IORING_OP_ZONE_APPEND_FIXED : append using fixed-buffers

Repurpose cqe->flags to return zone-relative offset.

Signed-off-by: SelvaKumar S 
Signed-off-by: Kanchan Joshi 
Signed-off-by: Nitesh Shetty 
Signed-off-by: Javier Gonzalez 
---
fs/io_uring.c | 72 +--
include/uapi/linux/io_uring.h |  8 -
2 files changed, 77 insertions(+), 3 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 155f3d8..c14c873 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -649,6 +649,10 @@ struct io_kiocb {
unsigned long   fsize;
u64 user_data;
u32 result;
+#ifdef CONFIG_BLK_DEV_ZONED
+   /* zone-relative offset for append, in bytes */
+   u32 append_offset;


this can overflow. u64 is needed.


We chose to do it this way to start with because struct io_uring_cqe
only has space for u32 when we reuse the flags.

We can of course create a new cqe structure, but that will come with
larger changes to io_uring for supporting append.

Do you believe this is a better approach?


The problem is that zone size are 32 bits in the kernel, as a number of sectors.
So any device that has a zone size smaller or equal to 2^31 512B sectors can be
accepted. Using a zone relative offset in bytes for returning zone append result
is OK-ish, but to match the kernel supported range of possible zone size, you
need 31+9 bits... 32 does not cut it.


Agree. Our initial assumption was that u32 would cover current zone size
requirements, but if this is a no-go, we will take the longer path.


Converting to u64 will require a new version of io_uring_cqe, where we
extend at least 32 bits. I believe this will need a whole new allocation
and probably ioctl().

Is this an acceptable change for you? We will of course add support for
liburing when we agree on the right way to do this.

Thanks,
Javier


Re: [PATCH 3/3] io_uring: add support for zone-append

2020-06-18 Thread javier.g...@samsung.com

On 18.06.2020 08:47, Damien Le Moal wrote:

On 2020/06/18 17:35, javier.g...@samsung.com wrote:

On 18.06.2020 07:39, Damien Le Moal wrote:

On 2020/06/18 2:27, Kanchan Joshi wrote:

From: Selvakumar S 

Introduce three new opcodes for zone-append -

   IORING_OP_ZONE_APPEND : non-vectord, similiar to IORING_OP_WRITE
   IORING_OP_ZONE_APPENDV: vectored, similar to IORING_OP_WRITEV
   IORING_OP_ZONE_APPEND_FIXED : append using fixed-buffers

Repurpose cqe->flags to return zone-relative offset.

Signed-off-by: SelvaKumar S 
Signed-off-by: Kanchan Joshi 
Signed-off-by: Nitesh Shetty 
Signed-off-by: Javier Gonzalez 
---
 fs/io_uring.c | 72 +--
 include/uapi/linux/io_uring.h |  8 -
 2 files changed, 77 insertions(+), 3 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 155f3d8..c14c873 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -649,6 +649,10 @@ struct io_kiocb {
unsigned long   fsize;
u64 user_data;
u32 result;
+#ifdef CONFIG_BLK_DEV_ZONED
+   /* zone-relative offset for append, in bytes */
+   u32 append_offset;


this can overflow. u64 is needed.


We chose to do it this way to start with because struct io_uring_cqe
only has space for u32 when we reuse the flags.

We can of course create a new cqe structure, but that will come with
larger changes to io_uring for supporting append.

Do you believe this is a better approach?


The problem is that zone size are 32 bits in the kernel, as a number of sectors.
So any device that has a zone size smaller or equal to 2^31 512B sectors can be
accepted. Using a zone relative offset in bytes for returning zone append result
is OK-ish, but to match the kernel supported range of possible zone size, you
need 31+9 bits... 32 does not cut it.


Agree. Our initial assumption was that u32 would cover current zone size
requirements, but if this is a no-go, we will take the longer path.



Since you need a 64-bit sized result, I would also prefer that you drop the zone
relative offset as a result and return the absolute offset instead. That makes
life easier for the applications since the zone append requests also must use
absolute offsets for zone start. An absolute offset as a result becomes
consistent with that and all other read/write system calls that all use absolute
offsets (seek() is the only one that I know of that can use a relative offset,
but that is not an IO system call).


Agree. Using relative offsets was a product of reusing the existing u32.
If we move to u64, there is no need to do an extra transformation.

Thanks Damien!
Javier



Re: [PATCH 3/3] io_uring: add support for zone-append

2020-06-18 Thread javier.g...@samsung.com

On 18.06.2020 07:39, Damien Le Moal wrote:

On 2020/06/18 2:27, Kanchan Joshi wrote:

From: Selvakumar S 

Introduce three new opcodes for zone-append -

   IORING_OP_ZONE_APPEND : non-vectord, similiar to IORING_OP_WRITE
   IORING_OP_ZONE_APPENDV: vectored, similar to IORING_OP_WRITEV
   IORING_OP_ZONE_APPEND_FIXED : append using fixed-buffers

Repurpose cqe->flags to return zone-relative offset.

Signed-off-by: SelvaKumar S 
Signed-off-by: Kanchan Joshi 
Signed-off-by: Nitesh Shetty 
Signed-off-by: Javier Gonzalez 
---
 fs/io_uring.c | 72 +--
 include/uapi/linux/io_uring.h |  8 -
 2 files changed, 77 insertions(+), 3 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 155f3d8..c14c873 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -649,6 +649,10 @@ struct io_kiocb {
unsigned long   fsize;
u64 user_data;
u32 result;
+#ifdef CONFIG_BLK_DEV_ZONED
+   /* zone-relative offset for append, in bytes */
+   u32 append_offset;


this can overflow. u64 is needed.


We chose to do it this way to start with because struct io_uring_cqe
only has space for u32 when we reuse the flags.

We can of course create a new cqe structure, but that will come with
larger changes to io_uring for supporting append.

Do you believe this is a better approach?

Javier