Re: [dm-crypt] [RFC PATCH 1/1] Add DM_CRYPT_FORCE_INLINE flag to dm-crypt target

2020-06-24 Thread Ignat Korchagin
On Wed, Jun 24, 2020 at 5:24 PM Eric Biggers  wrote:
>
> On Wed, Jun 24, 2020 at 09:24:07AM +0100, Ignat Korchagin wrote:
> > On Wed, Jun 24, 2020 at 6:04 AM Eric Biggers  wrote:
> > >
> > > On Fri, Jun 19, 2020 at 05:41:32PM +0100, Ignat Korchagin wrote:
> > > > Sometimes extra thread offloading imposed by dm-crypt hurts IO latency. 
> > > > This is
> > > > especially visible on busy systems with many processes/threads. 
> > > > Moreover, most
> > > > Crypto API implementaions are async, that is they offload crypto 
> > > > operations on
> > > > their own, so this dm-crypt offloading is excessive.
> > >
> > > This really should say "some Crypto API implementations are async" 
> > > instead of
> > > "most Crypto API implementations are async".
> >
> > The most accurate would probably be: most hardware-accelerated Crypto
> > API implementations are async
> >
> > > Notably, the AES-NI implementation of AES-XTS is synchronous if you call 
> > > it in a
> > > context where SIMD instructions are usable.  It's only asynchronous when 
> > > SIMD is
> > > not usable.  (This seems to have been missed in your blog post.)
> >
> > No, it was not. This is exactly why we made xts-proxy Crypto API
> > module as a second patch. But it seems now it does not make a big
> > difference if a used Crypto API implementation is synchronous as well
> > (based on some benchmarks outlined in the cover letter to this patch).
> > I think the v2 of this patch will not require a synchronous Crypto
> > API. This is probably a right thing to do, as the "inline" flag should
> > control the way how dm-crypt itself handles requests, not how Crypto
> > API handles requests. If a user wants to ensure a particular
> > synchronous Crypto API implementation, they can already reconfigure
> > dm-crypt and specify the implementation with a "capi:" prefix in the
> > the dm table description.
>
> I think you're missing the point.  Although xts-aes-aesni has the
> CRYPTO_ALG_ASYNC bit set, the actual implementation processes the request
> synchronously if SIMD instructions are currently usable.  That's always the 
> case
> in dm-crypt, as far as I can tell.  This algorithm has the ASYNC flag only
> because it's not synchronous when called in hardIRQ context.
>
> That's why your "xts-proxy" doesn't make a difference, and why it's misleading
> to suggest that the crypto API is doing its own queueing when you're primarily
> talking about xts-aes-aesni.  The crypto API definitely can do its own 
> queueing,
> mainly with hardware drivers.  But it doesn't in this common and relevant 
> case.

I think we're talking about the same things but from different points
of view. I would like to clarify that the whole post and this change
does not have the intention to focus on aesni (or any x86-specific
crypto optimizations). In fact it is quite the opposite: we want to
optimize the generic dm-crypt regardless of which crypto is used
(that's why I just used a NULL cipher in the cover letter). We also
have some arm64 machines [1] and I bet they would benefit here as
well. The important point my post tries to make is that the original
workqueue offloading in dm-crypt was added because the Crypto API was
synchronous back in the day and, exactly as you say, you may not be
able to use some hw-accelerated crypto in hard IRQ context as well as
doing non-hw crypto in hard IRQ context is a bad idea. Now, most
Crypto API are smart enough to figure out on their own if they should
process the request inline or offload it to a workqueue, so the
workarounds in the dm-crypt itself most likely are not needed. Though,
the generic Crypto API "cipher walk" function does refuse to "walk"
the buffers in hard IRQ context, so the "tasklet" functionality is
still required.

But from the dm-crypt perspective - it should not take into account if
a particular xts-aes-aesni implementation is MOSTLY synchronous -
those are details of the implementation of this particular cipher
dm-crypt has no visibility into. So it would be right to say in my
opinion if the cipher has the CRYPTO_ALG_ASYNC flag set - it can
offload the crypto request to a workqueue at any time. How often does
it do it - that's another story and probably should be reviewed
elsewhere, if it does it too often.

Ignat

[1]: https://blog.cloudflare.com/arm-takes-wing/

> - Eric


Re: [dm-crypt] [RFC PATCH 1/1] Add DM_CRYPT_FORCE_INLINE flag to dm-crypt target

2020-06-24 Thread Eric Biggers
On Wed, Jun 24, 2020 at 09:24:07AM +0100, Ignat Korchagin wrote:
> On Wed, Jun 24, 2020 at 6:04 AM Eric Biggers  wrote:
> >
> > On Fri, Jun 19, 2020 at 05:41:32PM +0100, Ignat Korchagin wrote:
> > > Sometimes extra thread offloading imposed by dm-crypt hurts IO latency. 
> > > This is
> > > especially visible on busy systems with many processes/threads. Moreover, 
> > > most
> > > Crypto API implementaions are async, that is they offload crypto 
> > > operations on
> > > their own, so this dm-crypt offloading is excessive.
> >
> > This really should say "some Crypto API implementations are async" instead 
> > of
> > "most Crypto API implementations are async".
> 
> The most accurate would probably be: most hardware-accelerated Crypto
> API implementations are async
> 
> > Notably, the AES-NI implementation of AES-XTS is synchronous if you call it 
> > in a
> > context where SIMD instructions are usable.  It's only asynchronous when 
> > SIMD is
> > not usable.  (This seems to have been missed in your blog post.)
> 
> No, it was not. This is exactly why we made xts-proxy Crypto API
> module as a second patch. But it seems now it does not make a big
> difference if a used Crypto API implementation is synchronous as well
> (based on some benchmarks outlined in the cover letter to this patch).
> I think the v2 of this patch will not require a synchronous Crypto
> API. This is probably a right thing to do, as the "inline" flag should
> control the way how dm-crypt itself handles requests, not how Crypto
> API handles requests. If a user wants to ensure a particular
> synchronous Crypto API implementation, they can already reconfigure
> dm-crypt and specify the implementation with a "capi:" prefix in the
> the dm table description.

I think you're missing the point.  Although xts-aes-aesni has the
CRYPTO_ALG_ASYNC bit set, the actual implementation processes the request
synchronously if SIMD instructions are currently usable.  That's always the case
in dm-crypt, as far as I can tell.  This algorithm has the ASYNC flag only
because it's not synchronous when called in hardIRQ context.

That's why your "xts-proxy" doesn't make a difference, and why it's misleading
to suggest that the crypto API is doing its own queueing when you're primarily
talking about xts-aes-aesni.  The crypto API definitely can do its own queueing,
mainly with hardware drivers.  But it doesn't in this common and relevant case.

- Eric


Re: [dm-crypt] [RFC PATCH 1/1] Add DM_CRYPT_FORCE_INLINE flag to dm-crypt target

2020-06-24 Thread Ignat Korchagin
On Wed, Jun 24, 2020 at 6:04 AM Eric Biggers  wrote:
>
> On Fri, Jun 19, 2020 at 05:41:32PM +0100, Ignat Korchagin wrote:
> > Sometimes extra thread offloading imposed by dm-crypt hurts IO latency. 
> > This is
> > especially visible on busy systems with many processes/threads. Moreover, 
> > most
> > Crypto API implementaions are async, that is they offload crypto operations 
> > on
> > their own, so this dm-crypt offloading is excessive.
>
> This really should say "some Crypto API implementations are async" instead of
> "most Crypto API implementations are async".

The most accurate would probably be: most hardware-accelerated Crypto
API implementations are async

> Notably, the AES-NI implementation of AES-XTS is synchronous if you call it 
> in a
> context where SIMD instructions are usable.  It's only asynchronous when SIMD 
> is
> not usable.  (This seems to have been missed in your blog post.)

No, it was not. This is exactly why we made xts-proxy Crypto API
module as a second patch. But it seems now it does not make a big
difference if a used Crypto API implementation is synchronous as well
(based on some benchmarks outlined in the cover letter to this patch).
I think the v2 of this patch will not require a synchronous Crypto
API. This is probably a right thing to do, as the "inline" flag should
control the way how dm-crypt itself handles requests, not how Crypto
API handles requests. If a user wants to ensure a particular
synchronous Crypto API implementation, they can already reconfigure
dm-crypt and specify the implementation with a "capi:" prefix in the
the dm table description.

> > This adds a new flag, which directs dm-crypt not to offload crypto 
> > operations
> > and process everything inline. For cases, where crypto operations cannot 
> > happen
> > inline (hard interrupt context, for example the read path of the NVME 
> > driver),
> > we offload the work to a tasklet rather than a workqueue.
>
> This patch both removes some dm-crypt specific queueing, and changes 
> decryption
> to use softIRQ context instead of a workqueue.  It would be useful to know how
> much of a difference the workqueue => softIRQ change makes by itself.  Such a
> change could be useful for fscrypt as well.  (fscrypt uses a workqueue for
> decryption, but besides that doesn't use any other queueing.)
>
> > @@ -127,7 +128,7 @@ struct iv_elephant_private {
> >   * and encrypts / decrypts at the same time.
> >   */
> >  enum flags { DM_CRYPT_SUSPENDED, DM_CRYPT_KEY_VALID,
> > -  DM_CRYPT_SAME_CPU, DM_CRYPT_NO_OFFLOAD };
> > +  DM_CRYPT_SAME_CPU, DM_CRYPT_NO_OFFLOAD, DM_CRYPT_FORCE_INLINE = 
> > (sizeof(unsigned long) * 8 - 1) };
>
> Assigning a specific enum value isn't necessary.

Yes, this is a leftover from our "internal" patch which I wanted to
make "future proof" in case future iterations of dm-crypt will add
some flags to avoid flag collisions. Will remove in v2.

>
> > @@ -1458,13 +1459,18 @@ static void crypt_alloc_req_skcipher(struct 
> > crypt_config *cc,
> >
> >   skcipher_request_set_tfm(ctx->r.req, cc->cipher_tfm.tfms[key_index]);
> >
> > - /*
> > -  * Use REQ_MAY_BACKLOG so a cipher driver internally backlogs
> > -  * requests if driver request queue is full.
> > -  */
> > - skcipher_request_set_callback(ctx->r.req,
> > - CRYPTO_TFM_REQ_MAY_BACKLOG,
> > - kcryptd_async_done, dmreq_of_req(cc, ctx->r.req));
> > + if (test_bit(DM_CRYPT_FORCE_INLINE, >flags))
> > + /* make sure we zero important fields of the request */
> > + skcipher_request_set_callback(ctx->r.req,
> > + 0, NULL, NULL);
> > + else
> > + /*
> > +  * Use REQ_MAY_BACKLOG so a cipher driver internally backlogs
> > +  * requests if driver request queue is full.
> > +  */
> > + skcipher_request_set_callback(ctx->r.req,
> > + CRYPTO_TFM_REQ_MAY_BACKLOG,
> > + kcryptd_async_done, dmreq_of_req(cc, ctx->r.req));
> >  }
>
> This looks wrong.  Unless type=0 and mask=CRYPTO_ALG_ASYNC are passed to
> crypto_alloc_skcipher(), the skcipher implementation can still be 
> asynchronous,
> in which case providing a callback is required.
>
> Do you intend that the "force_inline" option forces the use of a synchronous
> skcipher (alongside the other things it does)?  Or should it still allow
> asynchronous ones?

As mentioned above, I don't think we should require synchronous crypto
with the "force_inline" flag anymore. Although we may remove
CRYPTO_TFM_REQ_MAY_BACKLOG with the inline flag.

> We may not actually have a choice in that matter, since xts-aes-aesni has the
> CRYPTO_ALG_ASYNC bit set (as I mentioned) despite being synchronous in most
> cases; thus, the crypto API won't give you it if you ask for a synchronous
> cipher.  So I think you still need to allow async skciphers?  That means a
> callback is still always required.
>
> - Eric


Re: [dm-crypt] [RFC PATCH 1/1] Add DM_CRYPT_FORCE_INLINE flag to dm-crypt target

2020-06-23 Thread Eric Biggers
On Fri, Jun 19, 2020 at 05:41:32PM +0100, Ignat Korchagin wrote:
> Sometimes extra thread offloading imposed by dm-crypt hurts IO latency. This 
> is
> especially visible on busy systems with many processes/threads. Moreover, most
> Crypto API implementaions are async, that is they offload crypto operations on
> their own, so this dm-crypt offloading is excessive.

This really should say "some Crypto API implementations are async" instead of
"most Crypto API implementations are async".

Notably, the AES-NI implementation of AES-XTS is synchronous if you call it in a
context where SIMD instructions are usable.  It's only asynchronous when SIMD is
not usable.  (This seems to have been missed in your blog post.)

> This adds a new flag, which directs dm-crypt not to offload crypto operations
> and process everything inline. For cases, where crypto operations cannot 
> happen
> inline (hard interrupt context, for example the read path of the NVME driver),
> we offload the work to a tasklet rather than a workqueue.

This patch both removes some dm-crypt specific queueing, and changes decryption
to use softIRQ context instead of a workqueue.  It would be useful to know how
much of a difference the workqueue => softIRQ change makes by itself.  Such a
change could be useful for fscrypt as well.  (fscrypt uses a workqueue for
decryption, but besides that doesn't use any other queueing.)

> @@ -127,7 +128,7 @@ struct iv_elephant_private {
>   * and encrypts / decrypts at the same time.
>   */
>  enum flags { DM_CRYPT_SUSPENDED, DM_CRYPT_KEY_VALID,
> -  DM_CRYPT_SAME_CPU, DM_CRYPT_NO_OFFLOAD };
> +  DM_CRYPT_SAME_CPU, DM_CRYPT_NO_OFFLOAD, DM_CRYPT_FORCE_INLINE = 
> (sizeof(unsigned long) * 8 - 1) };

Assigning a specific enum value isn't necessary.

> @@ -1458,13 +1459,18 @@ static void crypt_alloc_req_skcipher(struct 
> crypt_config *cc,
>  
>   skcipher_request_set_tfm(ctx->r.req, cc->cipher_tfm.tfms[key_index]);
>  
> - /*
> -  * Use REQ_MAY_BACKLOG so a cipher driver internally backlogs
> -  * requests if driver request queue is full.
> -  */
> - skcipher_request_set_callback(ctx->r.req,
> - CRYPTO_TFM_REQ_MAY_BACKLOG,
> - kcryptd_async_done, dmreq_of_req(cc, ctx->r.req));
> + if (test_bit(DM_CRYPT_FORCE_INLINE, >flags))
> + /* make sure we zero important fields of the request */
> + skcipher_request_set_callback(ctx->r.req,
> + 0, NULL, NULL);
> + else
> + /*
> +  * Use REQ_MAY_BACKLOG so a cipher driver internally backlogs
> +  * requests if driver request queue is full.
> +  */
> + skcipher_request_set_callback(ctx->r.req,
> + CRYPTO_TFM_REQ_MAY_BACKLOG,
> + kcryptd_async_done, dmreq_of_req(cc, ctx->r.req));
>  }

This looks wrong.  Unless type=0 and mask=CRYPTO_ALG_ASYNC are passed to
crypto_alloc_skcipher(), the skcipher implementation can still be asynchronous,
in which case providing a callback is required.

Do you intend that the "force_inline" option forces the use of a synchronous
skcipher (alongside the other things it does)?  Or should it still allow
asynchronous ones?

We may not actually have a choice in that matter, since xts-aes-aesni has the
CRYPTO_ALG_ASYNC bit set (as I mentioned) despite being synchronous in most
cases; thus, the crypto API won't give you it if you ask for a synchronous
cipher.  So I think you still need to allow async skciphers?  That means a
callback is still always required.

- Eric