Re: [RFC PATCH 0/4] Asynchronous passthrough ioctl

2021-02-22 Thread Kanchan Joshi
On Mon, Feb 22, 2021 at 8:03 PM Jens Axboe  wrote:
>
> On 2/22/21 6:42 AM, Kanchan Joshi wrote:
> > On Thu, Jan 28, 2021 at 10:54 PM Jens Axboe  wrote:
> >>
> >> On 1/28/21 10:13 AM, Kanchan Joshi wrote:
> >>> On Thu, Jan 28, 2021 at 8:08 PM Jens Axboe  wrote:
> 
>  On 1/28/21 5:04 AM, Kanchan Joshi wrote:
> > On Wed, Jan 27, 2021 at 9:32 PM Pavel Begunkov  
> > wrote:
> >>
> >> On 27/01/2021 15:42, Pavel Begunkov wrote:
> >>> On 27/01/2021 15:00, Kanchan Joshi wrote:
>  This RFC patchset adds asynchronous ioctl capability for NVMe 
>  devices.
>  Purpose of RFC is to get the feedback and optimize the path.
> 
>  At the uppermost io-uring layer, a new opcode IORING_OP_IOCTL_PT is
>  presented to user-space applications. Like regular-ioctl, it takes
>  ioctl opcode and an optional argument (ioctl-specific input/output
>  parameter). Unlike regular-ioctl, it is made to skip the block-layer
>  and reach directly to the underlying driver (nvme in the case of this
>  patchset). This path between io-uring and nvme is via a newly
>  introduced block-device operation "async_ioctl". This operation
>  expects io-uring to supply a callback function which can be used to
>  report completion at later stage.
> 
>  For a regular ioctl, NVMe driver submits the command to the device 
>  and
>  the submitter (task) is made to wait until completion arrives. For
>  async-ioctl, completion is decoupled from submission. Submitter goes
>  back to its business without waiting for nvme-completion. When
>  nvme-completion arrives, it informs io-uring via the registered
>  completion-handler. But some ioctls may require updating certain
>  ioctl-specific fields which can be accessed only in context of the
>  submitter task. For that reason, NVMe driver uses task-work infra for
>  that ioctl-specific update. Since task-work is not exported, it 
>  cannot
>  be referenced when nvme is compiled as a module. Therefore, one of 
>  the
>  patch exports task-work API.
> 
>  Here goes example of usage (pseudo-code).
>  Actual nvme-cli source, modified to issue all ioctls via this opcode
>  is present at-
>  https://github.com/joshkan/nvme-cli/commit/a008a733f24ab5593e7874cfbc69ee04e88068c5
> >>>
> >>> see https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-fops
> >>>
> >>> Looks like good time to bring that branch/discussion back
> >>
> >> a bit more context:
> >> https://github.com/axboe/liburing/issues/270
> >
> > Thanks, it looked good. It seems key differences (compared to
> > uring-patch that I posted) are -
> > 1. using file-operation instead of block-dev operation.
> 
>  Right, it's meant to span wider than just block devices.
> 
> > 2. repurpose the sqe memory for ioctl-cmd. If an application does
> > ioctl with <=40 bytes of cmd, it does not have to allocate ioctl-cmd.
> > That's nifty. We still need to support passing larger-cmd (e.g.
> > nvme-passthru ioctl takes 72 bytes) but that shouldn't get too
> > difficult I suppose.
> 
>  It's actually 48 bytes in the as-posted version, and I've bumped it to
>  56 bytes in the latest branch. So not quite enough for everything,
>  nothing ever will be, but should work for a lot of cases without
>  requiring per-command allocations just for the actual command.
> >>>
> >>> Agreed. But if I got it right, you are open to support both in-the-sqe
> >>> command (<= 56 bytes) and out-of-sqe command (> 56 bytes) with this
> >>> interface.
> >>> Driver processing the ioctl can fetch the cmd from user-space in one
> >>> case (as it does now), and skips in another.
> >>
> >> Your out-of-seq command would be none of io_urings business, outside of
> >> the fact that we'd need to ensure it's stable if we need to postpone
> >> it. So yes, that would be fine, it just means your actual command is
> >> passed in as a pointer, and you would be responsible for copying it
> >> in for execution
> >>
> >> We're going to need something to handle postponing, and something
> >> for ensuring that eg cancelations free the allocated memory.
> >
> > I have few doubts about allocation/postponing. Are you referring to
> > uring allocating memory for this case, similar to the way
> > "req->async_data" is managed for few other opcodes?
> > Or can it (i.e. larger cmd) remain a user-space pointer, and the
> > underlying driver fetches the command in.
> > If submission context changes (for sqo/io-wq case), uring seemed to
> > apply context-grabbing techniques to make that work.
>
> There are two concerns here:
>
> 1) We need more space than the 48 bytes, which means we need to allocate
>the command or part of the command when get the sqe, 

Re: [RFC PATCH 0/4] Asynchronous passthrough ioctl

2021-02-22 Thread Jens Axboe
On 2/22/21 6:42 AM, Kanchan Joshi wrote:
> On Thu, Jan 28, 2021 at 10:54 PM Jens Axboe  wrote:
>>
>> On 1/28/21 10:13 AM, Kanchan Joshi wrote:
>>> On Thu, Jan 28, 2021 at 8:08 PM Jens Axboe  wrote:

 On 1/28/21 5:04 AM, Kanchan Joshi wrote:
> On Wed, Jan 27, 2021 at 9:32 PM Pavel Begunkov  
> wrote:
>>
>> On 27/01/2021 15:42, Pavel Begunkov wrote:
>>> On 27/01/2021 15:00, Kanchan Joshi wrote:
 This RFC patchset adds asynchronous ioctl capability for NVMe devices.
 Purpose of RFC is to get the feedback and optimize the path.

 At the uppermost io-uring layer, a new opcode IORING_OP_IOCTL_PT is
 presented to user-space applications. Like regular-ioctl, it takes
 ioctl opcode and an optional argument (ioctl-specific input/output
 parameter). Unlike regular-ioctl, it is made to skip the block-layer
 and reach directly to the underlying driver (nvme in the case of this
 patchset). This path between io-uring and nvme is via a newly
 introduced block-device operation "async_ioctl". This operation
 expects io-uring to supply a callback function which can be used to
 report completion at later stage.

 For a regular ioctl, NVMe driver submits the command to the device and
 the submitter (task) is made to wait until completion arrives. For
 async-ioctl, completion is decoupled from submission. Submitter goes
 back to its business without waiting for nvme-completion. When
 nvme-completion arrives, it informs io-uring via the registered
 completion-handler. But some ioctls may require updating certain
 ioctl-specific fields which can be accessed only in context of the
 submitter task. For that reason, NVMe driver uses task-work infra for
 that ioctl-specific update. Since task-work is not exported, it cannot
 be referenced when nvme is compiled as a module. Therefore, one of the
 patch exports task-work API.

 Here goes example of usage (pseudo-code).
 Actual nvme-cli source, modified to issue all ioctls via this opcode
 is present at-
 https://github.com/joshkan/nvme-cli/commit/a008a733f24ab5593e7874cfbc69ee04e88068c5
>>>
>>> see https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-fops
>>>
>>> Looks like good time to bring that branch/discussion back
>>
>> a bit more context:
>> https://github.com/axboe/liburing/issues/270
>
> Thanks, it looked good. It seems key differences (compared to
> uring-patch that I posted) are -
> 1. using file-operation instead of block-dev operation.

 Right, it's meant to span wider than just block devices.

> 2. repurpose the sqe memory for ioctl-cmd. If an application does
> ioctl with <=40 bytes of cmd, it does not have to allocate ioctl-cmd.
> That's nifty. We still need to support passing larger-cmd (e.g.
> nvme-passthru ioctl takes 72 bytes) but that shouldn't get too
> difficult I suppose.

 It's actually 48 bytes in the as-posted version, and I've bumped it to
 56 bytes in the latest branch. So not quite enough for everything,
 nothing ever will be, but should work for a lot of cases without
 requiring per-command allocations just for the actual command.
>>>
>>> Agreed. But if I got it right, you are open to support both in-the-sqe
>>> command (<= 56 bytes) and out-of-sqe command (> 56 bytes) with this
>>> interface.
>>> Driver processing the ioctl can fetch the cmd from user-space in one
>>> case (as it does now), and skips in another.
>>
>> Your out-of-seq command would be none of io_urings business, outside of
>> the fact that we'd need to ensure it's stable if we need to postpone
>> it. So yes, that would be fine, it just means your actual command is
>> passed in as a pointer, and you would be responsible for copying it
>> in for execution
>>
>> We're going to need something to handle postponing, and something
>> for ensuring that eg cancelations free the allocated memory.
> 
> I have few doubts about allocation/postponing. Are you referring to
> uring allocating memory for this case, similar to the way
> "req->async_data" is managed for few other opcodes?
> Or can it (i.e. larger cmd) remain a user-space pointer, and the
> underlying driver fetches the command in.
> If submission context changes (for sqo/io-wq case), uring seemed to
> apply context-grabbing techniques to make that work.

There are two concerns here:

1) We need more space than the 48 bytes, which means we need to allocate
   the command or part of the command when get the sqe, and of course
   free that when the command is done.

2) When io_uring_enter() returns and has consumed N commands, the state
   for those commands must be stable. Hence if you're passing in a
   pointer to a struct, that struct will have been read and store
   safely. This prevents 

Re: [RFC PATCH 0/4] Asynchronous passthrough ioctl

2021-02-22 Thread Kanchan Joshi
On Thu, Jan 28, 2021 at 10:54 PM Jens Axboe  wrote:
>
> On 1/28/21 10:13 AM, Kanchan Joshi wrote:
> > On Thu, Jan 28, 2021 at 8:08 PM Jens Axboe  wrote:
> >>
> >> On 1/28/21 5:04 AM, Kanchan Joshi wrote:
> >>> On Wed, Jan 27, 2021 at 9:32 PM Pavel Begunkov  
> >>> wrote:
> 
>  On 27/01/2021 15:42, Pavel Begunkov wrote:
> > On 27/01/2021 15:00, Kanchan Joshi wrote:
> >> This RFC patchset adds asynchronous ioctl capability for NVMe devices.
> >> Purpose of RFC is to get the feedback and optimize the path.
> >>
> >> At the uppermost io-uring layer, a new opcode IORING_OP_IOCTL_PT is
> >> presented to user-space applications. Like regular-ioctl, it takes
> >> ioctl opcode and an optional argument (ioctl-specific input/output
> >> parameter). Unlike regular-ioctl, it is made to skip the block-layer
> >> and reach directly to the underlying driver (nvme in the case of this
> >> patchset). This path between io-uring and nvme is via a newly
> >> introduced block-device operation "async_ioctl". This operation
> >> expects io-uring to supply a callback function which can be used to
> >> report completion at later stage.
> >>
> >> For a regular ioctl, NVMe driver submits the command to the device and
> >> the submitter (task) is made to wait until completion arrives. For
> >> async-ioctl, completion is decoupled from submission. Submitter goes
> >> back to its business without waiting for nvme-completion. When
> >> nvme-completion arrives, it informs io-uring via the registered
> >> completion-handler. But some ioctls may require updating certain
> >> ioctl-specific fields which can be accessed only in context of the
> >> submitter task. For that reason, NVMe driver uses task-work infra for
> >> that ioctl-specific update. Since task-work is not exported, it cannot
> >> be referenced when nvme is compiled as a module. Therefore, one of the
> >> patch exports task-work API.
> >>
> >> Here goes example of usage (pseudo-code).
> >> Actual nvme-cli source, modified to issue all ioctls via this opcode
> >> is present at-
> >> https://github.com/joshkan/nvme-cli/commit/a008a733f24ab5593e7874cfbc69ee04e88068c5
> >
> > see https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-fops
> >
> > Looks like good time to bring that branch/discussion back
> 
>  a bit more context:
>  https://github.com/axboe/liburing/issues/270
> >>>
> >>> Thanks, it looked good. It seems key differences (compared to
> >>> uring-patch that I posted) are -
> >>> 1. using file-operation instead of block-dev operation.
> >>
> >> Right, it's meant to span wider than just block devices.
> >>
> >>> 2. repurpose the sqe memory for ioctl-cmd. If an application does
> >>> ioctl with <=40 bytes of cmd, it does not have to allocate ioctl-cmd.
> >>> That's nifty. We still need to support passing larger-cmd (e.g.
> >>> nvme-passthru ioctl takes 72 bytes) but that shouldn't get too
> >>> difficult I suppose.
> >>
> >> It's actually 48 bytes in the as-posted version, and I've bumped it to
> >> 56 bytes in the latest branch. So not quite enough for everything,
> >> nothing ever will be, but should work for a lot of cases without
> >> requiring per-command allocations just for the actual command.
> >
> > Agreed. But if I got it right, you are open to support both in-the-sqe
> > command (<= 56 bytes) and out-of-sqe command (> 56 bytes) with this
> > interface.
> > Driver processing the ioctl can fetch the cmd from user-space in one
> > case (as it does now), and skips in another.
>
> Your out-of-seq command would be none of io_urings business, outside of
> the fact that we'd need to ensure it's stable if we need to postpone
> it. So yes, that would be fine, it just means your actual command is
> passed in as a pointer, and you would be responsible for copying it
> in for execution
>
> We're going to need something to handle postponing, and something
> for ensuring that eg cancelations free the allocated memory.

I have few doubts about allocation/postponing. Are you referring to
uring allocating memory for this case, similar to the way
"req->async_data" is managed for few other opcodes?
Or can it (i.e. larger cmd) remain a user-space pointer, and the
underlying driver fetches the command in.
If submission context changes (for sqo/io-wq case), uring seemed to
apply context-grabbing techniques to make that work.

> >>> And for some ioctls, driver may still need to use task-work to update
> >>> the user-space pointers (embedded in uring/ioctl cmd) during
> >>> completion.
> >>>
> >>> @Jens - will it be fine if I start looking at plumbing nvme-part of
> >>> this series on top of your work?
> >>
> >> Sure, go ahead. Just beware that things are still changing, so you might
> >> have to adapt it a few times. It's still early days, but I do think
> >> that's the way forward in providing controlled access to what is
> >> ba

Re: [RFC PATCH 0/4] Asynchronous passthrough ioctl

2021-01-28 Thread Kanchan Joshi
On Thu, Jan 28, 2021 at 8:20 PM Jens Axboe  wrote:
>
> On 1/28/21 5:04 AM, Kanchan Joshi wrote:
> > And for some ioctls, driver may still need to use task-work to update
> > the user-space pointers (embedded in uring/ioctl cmd) during
> > completion.
>
> For this use case, we should ensure that just io_uring handles this
> part. It's already got everything setup for it, and I'd rather avoid
> having drivers touch any of those parts. Could be done by having an
> io_uring helper ala:
>
> io_uring_cmd_complete_in_task(cmd, handler);
>
> which takes care of the nitty gritty details.

Ah right. With that, I can do away with exporting task-work.
NVMe completion can invoke (depending on ioctl) this uring-helper with
a handler that does the ioctl-specific update in task context.



-- 
Kanchan


Re: [RFC PATCH 0/4] Asynchronous passthrough ioctl

2021-01-28 Thread Jens Axboe
On 1/28/21 10:13 AM, Kanchan Joshi wrote:
> On Thu, Jan 28, 2021 at 8:08 PM Jens Axboe  wrote:
>>
>> On 1/28/21 5:04 AM, Kanchan Joshi wrote:
>>> On Wed, Jan 27, 2021 at 9:32 PM Pavel Begunkov  
>>> wrote:

 On 27/01/2021 15:42, Pavel Begunkov wrote:
> On 27/01/2021 15:00, Kanchan Joshi wrote:
>> This RFC patchset adds asynchronous ioctl capability for NVMe devices.
>> Purpose of RFC is to get the feedback and optimize the path.
>>
>> At the uppermost io-uring layer, a new opcode IORING_OP_IOCTL_PT is
>> presented to user-space applications. Like regular-ioctl, it takes
>> ioctl opcode and an optional argument (ioctl-specific input/output
>> parameter). Unlike regular-ioctl, it is made to skip the block-layer
>> and reach directly to the underlying driver (nvme in the case of this
>> patchset). This path between io-uring and nvme is via a newly
>> introduced block-device operation "async_ioctl". This operation
>> expects io-uring to supply a callback function which can be used to
>> report completion at later stage.
>>
>> For a regular ioctl, NVMe driver submits the command to the device and
>> the submitter (task) is made to wait until completion arrives. For
>> async-ioctl, completion is decoupled from submission. Submitter goes
>> back to its business without waiting for nvme-completion. When
>> nvme-completion arrives, it informs io-uring via the registered
>> completion-handler. But some ioctls may require updating certain
>> ioctl-specific fields which can be accessed only in context of the
>> submitter task. For that reason, NVMe driver uses task-work infra for
>> that ioctl-specific update. Since task-work is not exported, it cannot
>> be referenced when nvme is compiled as a module. Therefore, one of the
>> patch exports task-work API.
>>
>> Here goes example of usage (pseudo-code).
>> Actual nvme-cli source, modified to issue all ioctls via this opcode
>> is present at-
>> https://github.com/joshkan/nvme-cli/commit/a008a733f24ab5593e7874cfbc69ee04e88068c5
>
> see https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-fops
>
> Looks like good time to bring that branch/discussion back

 a bit more context:
 https://github.com/axboe/liburing/issues/270
>>>
>>> Thanks, it looked good. It seems key differences (compared to
>>> uring-patch that I posted) are -
>>> 1. using file-operation instead of block-dev operation.
>>
>> Right, it's meant to span wider than just block devices.
>>
>>> 2. repurpose the sqe memory for ioctl-cmd. If an application does
>>> ioctl with <=40 bytes of cmd, it does not have to allocate ioctl-cmd.
>>> That's nifty. We still need to support passing larger-cmd (e.g.
>>> nvme-passthru ioctl takes 72 bytes) but that shouldn't get too
>>> difficult I suppose.
>>
>> It's actually 48 bytes in the as-posted version, and I've bumped it to
>> 56 bytes in the latest branch. So not quite enough for everything,
>> nothing ever will be, but should work for a lot of cases without
>> requiring per-command allocations just for the actual command.
> 
> Agreed. But if I got it right, you are open to support both in-the-sqe
> command (<= 56 bytes) and out-of-sqe command (> 56 bytes) with this
> interface.
> Driver processing the ioctl can fetch the cmd from user-space in one
> case (as it does now), and skips in another.

Your out-of-seq command would be none of io_urings business, outside of
the fact that we'd need to ensure it's stable if we need to postpone
it. So yes, that would be fine, it just means your actual command is
passed in as a pointer, and you would be responsible for copying it
in for execution

We're going to need something to handle postponing, and something
for ensuring that eg cancelations free the allocated memory.

>>> And for some ioctls, driver may still need to use task-work to update
>>> the user-space pointers (embedded in uring/ioctl cmd) during
>>> completion.
>>>
>>> @Jens - will it be fine if I start looking at plumbing nvme-part of
>>> this series on top of your work?
>>
>> Sure, go ahead. Just beware that things are still changing, so you might
>> have to adapt it a few times. It's still early days, but I do think
>> that's the way forward in providing controlled access to what is
>> basically async ioctls.
> 
> Sounds good, I will start with the latest branch that you posted. Thanks.

It's io_uring-fops.v2 for now, use that one.

-- 
Jens Axboe



Re: [RFC PATCH 0/4] Asynchronous passthrough ioctl

2021-01-28 Thread Kanchan Joshi
On Thu, Jan 28, 2021 at 8:08 PM Jens Axboe  wrote:
>
> On 1/28/21 5:04 AM, Kanchan Joshi wrote:
> > On Wed, Jan 27, 2021 at 9:32 PM Pavel Begunkov  
> > wrote:
> >>
> >> On 27/01/2021 15:42, Pavel Begunkov wrote:
> >>> On 27/01/2021 15:00, Kanchan Joshi wrote:
>  This RFC patchset adds asynchronous ioctl capability for NVMe devices.
>  Purpose of RFC is to get the feedback and optimize the path.
> 
>  At the uppermost io-uring layer, a new opcode IORING_OP_IOCTL_PT is
>  presented to user-space applications. Like regular-ioctl, it takes
>  ioctl opcode and an optional argument (ioctl-specific input/output
>  parameter). Unlike regular-ioctl, it is made to skip the block-layer
>  and reach directly to the underlying driver (nvme in the case of this
>  patchset). This path between io-uring and nvme is via a newly
>  introduced block-device operation "async_ioctl". This operation
>  expects io-uring to supply a callback function which can be used to
>  report completion at later stage.
> 
>  For a regular ioctl, NVMe driver submits the command to the device and
>  the submitter (task) is made to wait until completion arrives. For
>  async-ioctl, completion is decoupled from submission. Submitter goes
>  back to its business without waiting for nvme-completion. When
>  nvme-completion arrives, it informs io-uring via the registered
>  completion-handler. But some ioctls may require updating certain
>  ioctl-specific fields which can be accessed only in context of the
>  submitter task. For that reason, NVMe driver uses task-work infra for
>  that ioctl-specific update. Since task-work is not exported, it cannot
>  be referenced when nvme is compiled as a module. Therefore, one of the
>  patch exports task-work API.
> 
>  Here goes example of usage (pseudo-code).
>  Actual nvme-cli source, modified to issue all ioctls via this opcode
>  is present at-
>  https://github.com/joshkan/nvme-cli/commit/a008a733f24ab5593e7874cfbc69ee04e88068c5
> >>>
> >>> see https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-fops
> >>>
> >>> Looks like good time to bring that branch/discussion back
> >>
> >> a bit more context:
> >> https://github.com/axboe/liburing/issues/270
> >
> > Thanks, it looked good. It seems key differences (compared to
> > uring-patch that I posted) are -
> > 1. using file-operation instead of block-dev operation.
>
> Right, it's meant to span wider than just block devices.
>
> > 2. repurpose the sqe memory for ioctl-cmd. If an application does
> > ioctl with <=40 bytes of cmd, it does not have to allocate ioctl-cmd.
> > That's nifty. We still need to support passing larger-cmd (e.g.
> > nvme-passthru ioctl takes 72 bytes) but that shouldn't get too
> > difficult I suppose.
>
> It's actually 48 bytes in the as-posted version, and I've bumped it to
> 56 bytes in the latest branch. So not quite enough for everything,
> nothing ever will be, but should work for a lot of cases without
> requiring per-command allocations just for the actual command.

Agreed. But if I got it right, you are open to support both in-the-sqe
command (<= 56 bytes) and out-of-sqe command (> 56 bytes) with this
interface.
Driver processing the ioctl can fetch the cmd from user-space in one
case (as it does now), and skips in another.

> > And for some ioctls, driver may still need to use task-work to update
> > the user-space pointers (embedded in uring/ioctl cmd) during
> > completion.
> >
> > @Jens - will it be fine if I start looking at plumbing nvme-part of
> > this series on top of your work?
>
> Sure, go ahead. Just beware that things are still changing, so you might
> have to adapt it a few times. It's still early days, but I do think
> that's the way forward in providing controlled access to what is
> basically async ioctls.

Sounds good, I will start with the latest branch that you posted. Thanks.

-- 
Kanchan


Re: [RFC PATCH 0/4] Asynchronous passthrough ioctl

2021-01-28 Thread Jens Axboe
On 1/28/21 5:04 AM, Kanchan Joshi wrote:
> And for some ioctls, driver may still need to use task-work to update
> the user-space pointers (embedded in uring/ioctl cmd) during
> completion.

For this use case, we should ensure that just io_uring handles this
part. It's already got everything setup for it, and I'd rather avoid
having drivers touch any of those parts. Could be done by having an
io_uring helper ala:

io_uring_cmd_complete_in_task(cmd, handler);

which takes care of the nitty gritty details.

-- 
Jens Axboe



Re: [RFC PATCH 0/4] Asynchronous passthrough ioctl

2021-01-28 Thread Jens Axboe
On 1/28/21 5:04 AM, Kanchan Joshi wrote:
> On Wed, Jan 27, 2021 at 9:32 PM Pavel Begunkov  wrote:
>>
>> On 27/01/2021 15:42, Pavel Begunkov wrote:
>>> On 27/01/2021 15:00, Kanchan Joshi wrote:
 This RFC patchset adds asynchronous ioctl capability for NVMe devices.
 Purpose of RFC is to get the feedback and optimize the path.

 At the uppermost io-uring layer, a new opcode IORING_OP_IOCTL_PT is
 presented to user-space applications. Like regular-ioctl, it takes
 ioctl opcode and an optional argument (ioctl-specific input/output
 parameter). Unlike regular-ioctl, it is made to skip the block-layer
 and reach directly to the underlying driver (nvme in the case of this
 patchset). This path between io-uring and nvme is via a newly
 introduced block-device operation "async_ioctl". This operation
 expects io-uring to supply a callback function which can be used to
 report completion at later stage.

 For a regular ioctl, NVMe driver submits the command to the device and
 the submitter (task) is made to wait until completion arrives. For
 async-ioctl, completion is decoupled from submission. Submitter goes
 back to its business without waiting for nvme-completion. When
 nvme-completion arrives, it informs io-uring via the registered
 completion-handler. But some ioctls may require updating certain
 ioctl-specific fields which can be accessed only in context of the
 submitter task. For that reason, NVMe driver uses task-work infra for
 that ioctl-specific update. Since task-work is not exported, it cannot
 be referenced when nvme is compiled as a module. Therefore, one of the
 patch exports task-work API.

 Here goes example of usage (pseudo-code).
 Actual nvme-cli source, modified to issue all ioctls via this opcode
 is present at-
 https://github.com/joshkan/nvme-cli/commit/a008a733f24ab5593e7874cfbc69ee04e88068c5
>>>
>>> see https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-fops
>>>
>>> Looks like good time to bring that branch/discussion back
>>
>> a bit more context:
>> https://github.com/axboe/liburing/issues/270
> 
> Thanks, it looked good. It seems key differences (compared to
> uring-patch that I posted) are -
> 1. using file-operation instead of block-dev operation.

Right, it's meant to span wider than just block devices.

> 2. repurpose the sqe memory for ioctl-cmd. If an application does
> ioctl with <=40 bytes of cmd, it does not have to allocate ioctl-cmd.
> That's nifty. We still need to support passing larger-cmd (e.g.
> nvme-passthru ioctl takes 72 bytes) but that shouldn't get too
> difficult I suppose.

It's actually 48 bytes in the as-posted version, and I've bumped it to
56 bytes in the latest branch. So not quite enough for everything,
nothing ever will be, but should work for a lot of cases without
requiring per-command allocations just for the actual command.

> And for some ioctls, driver may still need to use task-work to update
> the user-space pointers (embedded in uring/ioctl cmd) during
> completion.
> 
> @Jens - will it be fine if I start looking at plumbing nvme-part of
> this series on top of your work?

Sure, go ahead. Just beware that things are still changing, so you might
have to adapt it a few times. It's still early days, but I do think
that's the way forward in providing controlled access to what is
basically async ioctls.

-- 
Jens Axboe



Re: [RFC PATCH 0/4] Asynchronous passthrough ioctl

2021-01-28 Thread Kanchan Joshi
On Wed, Jan 27, 2021 at 9:32 PM Pavel Begunkov  wrote:
>
> On 27/01/2021 15:42, Pavel Begunkov wrote:
> > On 27/01/2021 15:00, Kanchan Joshi wrote:
> >> This RFC patchset adds asynchronous ioctl capability for NVMe devices.
> >> Purpose of RFC is to get the feedback and optimize the path.
> >>
> >> At the uppermost io-uring layer, a new opcode IORING_OP_IOCTL_PT is
> >> presented to user-space applications. Like regular-ioctl, it takes
> >> ioctl opcode and an optional argument (ioctl-specific input/output
> >> parameter). Unlike regular-ioctl, it is made to skip the block-layer
> >> and reach directly to the underlying driver (nvme in the case of this
> >> patchset). This path between io-uring and nvme is via a newly
> >> introduced block-device operation "async_ioctl". This operation
> >> expects io-uring to supply a callback function which can be used to
> >> report completion at later stage.
> >>
> >> For a regular ioctl, NVMe driver submits the command to the device and
> >> the submitter (task) is made to wait until completion arrives. For
> >> async-ioctl, completion is decoupled from submission. Submitter goes
> >> back to its business without waiting for nvme-completion. When
> >> nvme-completion arrives, it informs io-uring via the registered
> >> completion-handler. But some ioctls may require updating certain
> >> ioctl-specific fields which can be accessed only in context of the
> >> submitter task. For that reason, NVMe driver uses task-work infra for
> >> that ioctl-specific update. Since task-work is not exported, it cannot
> >> be referenced when nvme is compiled as a module. Therefore, one of the
> >> patch exports task-work API.
> >>
> >> Here goes example of usage (pseudo-code).
> >> Actual nvme-cli source, modified to issue all ioctls via this opcode
> >> is present at-
> >> https://github.com/joshkan/nvme-cli/commit/a008a733f24ab5593e7874cfbc69ee04e88068c5
> >
> > see https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-fops
> >
> > Looks like good time to bring that branch/discussion back
>
> a bit more context:
> https://github.com/axboe/liburing/issues/270

Thanks, it looked good. It seems key differences (compared to
uring-patch that I posted) are -
1. using file-operation instead of block-dev operation.
2. repurpose the sqe memory for ioctl-cmd. If an application does
ioctl with <=40 bytes of cmd, it does not have to allocate ioctl-cmd.
That's nifty. We still need to support passing larger-cmd (e.g.
nvme-passthru ioctl takes 72 bytes) but that shouldn't get too
difficult I suppose.

And for some ioctls, driver may still need to use task-work to update
the user-space pointers (embedded in uring/ioctl cmd) during
completion.

@Jens - will it be fine if I start looking at plumbing nvme-part of
this series on top of your work?


Thanks,


Re: [RFC PATCH 0/4] Asynchronous passthrough ioctl

2021-01-27 Thread Pavel Begunkov
On 27/01/2021 15:42, Pavel Begunkov wrote:
> On 27/01/2021 15:00, Kanchan Joshi wrote:
>> This RFC patchset adds asynchronous ioctl capability for NVMe devices.
>> Purpose of RFC is to get the feedback and optimize the path.
>>
>> At the uppermost io-uring layer, a new opcode IORING_OP_IOCTL_PT is
>> presented to user-space applications. Like regular-ioctl, it takes
>> ioctl opcode and an optional argument (ioctl-specific input/output
>> parameter). Unlike regular-ioctl, it is made to skip the block-layer
>> and reach directly to the underlying driver (nvme in the case of this
>> patchset). This path between io-uring and nvme is via a newly
>> introduced block-device operation "async_ioctl". This operation
>> expects io-uring to supply a callback function which can be used to
>> report completion at later stage.
>>
>> For a regular ioctl, NVMe driver submits the command to the device and
>> the submitter (task) is made to wait until completion arrives. For
>> async-ioctl, completion is decoupled from submission. Submitter goes
>> back to its business without waiting for nvme-completion. When
>> nvme-completion arrives, it informs io-uring via the registered
>> completion-handler. But some ioctls may require updating certain
>> ioctl-specific fields which can be accessed only in context of the
>> submitter task. For that reason, NVMe driver uses task-work infra for
>> that ioctl-specific update. Since task-work is not exported, it cannot
>> be referenced when nvme is compiled as a module. Therefore, one of the
>> patch exports task-work API.
>>
>> Here goes example of usage (pseudo-code).
>> Actual nvme-cli source, modified to issue all ioctls via this opcode
>> is present at-
>> https://github.com/joshkan/nvme-cli/commit/a008a733f24ab5593e7874cfbc69ee04e88068c5
> 
> see https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-fops
> 
> Looks like good time to bring that branch/discussion back

a bit more context:
https://github.com/axboe/liburing/issues/270

-- 
Pavel Begunkov


Re: [RFC PATCH 0/4] Asynchronous passthrough ioctl

2021-01-27 Thread Pavel Begunkov
On 27/01/2021 15:00, Kanchan Joshi wrote:
> This RFC patchset adds asynchronous ioctl capability for NVMe devices.
> Purpose of RFC is to get the feedback and optimize the path.
> 
> At the uppermost io-uring layer, a new opcode IORING_OP_IOCTL_PT is
> presented to user-space applications. Like regular-ioctl, it takes
> ioctl opcode and an optional argument (ioctl-specific input/output
> parameter). Unlike regular-ioctl, it is made to skip the block-layer
> and reach directly to the underlying driver (nvme in the case of this
> patchset). This path between io-uring and nvme is via a newly
> introduced block-device operation "async_ioctl". This operation
> expects io-uring to supply a callback function which can be used to
> report completion at later stage.
> 
> For a regular ioctl, NVMe driver submits the command to the device and
> the submitter (task) is made to wait until completion arrives. For
> async-ioctl, completion is decoupled from submission. Submitter goes
> back to its business without waiting for nvme-completion. When
> nvme-completion arrives, it informs io-uring via the registered
> completion-handler. But some ioctls may require updating certain
> ioctl-specific fields which can be accessed only in context of the
> submitter task. For that reason, NVMe driver uses task-work infra for
> that ioctl-specific update. Since task-work is not exported, it cannot
> be referenced when nvme is compiled as a module. Therefore, one of the
> patch exports task-work API.
> 
> Here goes example of usage (pseudo-code).
> Actual nvme-cli source, modified to issue all ioctls via this opcode
> is present at-
> https://github.com/joshkan/nvme-cli/commit/a008a733f24ab5593e7874cfbc69ee04e88068c5

see https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-fops

Looks like good time to bring that branch/discussion back

> 
> With regular ioctl-
> int nvme_submit_passthru(int fd, unsigned long ioctl_cmd,
>  struct nvme_passthru_cmd *cmd)
> {
>   return ioctl(fd, ioctl_cmd, cmd);
> }
> 
> With uring passthru ioctl-
> int nvme_submit_passthru(int fd, unsigned long ioctl_cmd,
>  struct nvme_passthru_cmd *cmd)
> {
>   return uring_ioctl(fd, ioctl_cmd, cmd);
> }
> int uring_ioctl(int fd, unsinged long cmd, u64 arg)
> {
>   sqe = io_uring_get_sqe(ring);
> 
>   /* prepare sqe */
>   sqe->fd = fd;
>   sqe->opcode = IORING_OP_IOCTL_PT;
>   sqe->ioctl_cmd = cmd;
>   sqe->ioctl_arg = arg;
> 
>   /* submit sqe */
>   io_uring_submit(ring);
> 
>   /* reap completion and obtain result */
>   io_uring_wait_cqe(ring, &cqe);
>   printf("ioctl result =%d\n", cqe->res)
> }
> 
> Kanchan Joshi (4):
>   block: introduce async ioctl operation
>   kernel: export task_work_add
>   nvme: add async ioctl support
>   io_uring: add async passthrough ioctl support
> 
>  drivers/nvme/host/core.c  | 347 +++---
>  fs/io_uring.c |  77 
>  include/linux/blkdev.h|  12 ++
>  include/uapi/linux/io_uring.h |   7 +-
>  kernel/task_work.c|   2 +-
>  5 files changed, 376 insertions(+), 69 deletions(-)
> 

-- 
Pavel Begunkov