Re: [dm-devel] [RFC PATCH 0/1] dm-crypt excessive overhead

2020-06-23 Thread Damien Le Moal
On 2020/06/24 0:23, Mike Snitzer wrote:
> On Tue, Jun 23 2020 at 11:07am -0400,
> Ignat Korchagin  wrote:
> 
>> Do you think it may be better to break it in two flags: one for read
>> path and one for write? So, depending on the needs and workflow these
>> could be enabled independently?
> 
> If there is a need to split, then sure.  But I think Damien had a hard
> requirement that writes had to be inlined but that reads didn't _need_
> to be for his dm-zoned usecase.  Damien may not yet have assessed the
> performance implications, of not have reads inlined, as much as you
> have.

We did do performance testing :)
The results are mixed and performance differences between inline vs workqueues
depend on the workload (IO size, IO queue depth and number of drives being used
mostly). In many cases, inlining everything does really improve performance as
Ignat reported.

In our testing, we used hard drives and so focused mostly on throughput rather
than command latency. The added workqueue context switch overhead and crypto
work latency compared to typical HDD IO times is small, and significant only if
the backend storage as short IO times.

In the case of HDDs, especially for large IO sizes, inlining crypto work does
not shine as it prevents an efficient use of CPU resources. This is especially
true with reads on a large system with many drives connected to a single HBA:
the softirq context decryption work does not lend itself well to using other
CPUs that did not receive the HBA IRQ signaling command completions. The test
results clearly show much higher throughputs using dm-crypt as is.

On the other hand, inlining crypto work significantly improves workloads of
small random IOs, even for a large number of disks: removing the overhead of
context switches allows faster completions, allowing sending more requests to
the drives more quickly, keeping them busy.

For SMR, the inlining of write requests is *mandatory* to preserve the issuer
write sequence, but encryption work being done in the issuer context (writes to
SMR drives can only be O_DIRECT writes), efficient CPU resource usage can be
achieved by simply using multiple writer thread/processes, working on different
zones of different disks. This is a very reasonable model for SMR as writes into
a single zone have to be done under mutual exclusion to ensure sequentiality.

For reads, SMR drives are essentially exactly the same as regular disks, so
as-is or inline are both OK. Based on our performance results, allowing the user
to have the choice of inlining or not reads based on the target workload would
be great.

Of note is that zone append writes (emulated in SCSI, native with NVMe) are not
subject to the sequential write constraint, so they can also be executed either
inline or asynchronously.

> So let's see how Damien's work goes and if he trully doesn't need/want
> reads to be inlined then 2 flags can be created.

For SMR, I do not need inline reads, but I do want the user to have the
possibility of using this setup as that can provide better performance for some
workloads. I think that splitting the inline flag in 2 is exactly what we want:

1) For SMR, the write-inline flag can be automatically turned on when the target
device is created if the backend device used is a host-managed zoned drive (scsi
or NVMe ZNS). For reads, it would be the user choice, based on the target 
workload.
2) For regular block devices, write-inline only, read-inline only or both would
be the user choice, to optimize for their target workload.

With the split into 2 flags, my SMR support patch becomes very simple.

> 
> Mike
> 
> --
> dm-devel mailing list
> dm-de...@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
> 
> 


-- 
Damien Le Moal
Western Digital Research


Re: [dm-devel] [RFC PATCH 0/1] dm-crypt excessive overhead

2020-06-22 Thread Damien Le Moal
On 2020/06/22 16:55, Ignat Korchagin wrote:
> On Mon, Jun 22, 2020 at 1:45 AM Damien Le Moal  wrote:
>>
>> On 2020/06/20 1:56, Mike Snitzer wrote:
>>> On Fri, Jun 19 2020 at 12:41pm -0400,
>>> Ignat Korchagin  wrote:
>>>
 This is a follow up from the long-forgotten [1], but with some more 
 convincing
 evidence. Consider the following script:

 #!/bin/bash -e

 # create 4G ramdisk
 sudo modprobe brd rd_nr=1 rd_size=4194304

 # create a dm-crypt device with NULL cipher on top of /dev/ram0
 echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0' | sudo 
 dmsetup create eram0

 # create a dm-crypt device with NULL cipher and custom force_inline flag
 echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0 1 
 force_inline' | sudo dmsetup create inline-eram0

 # read all data from /dev/ram0
 sudo dd if=/dev/ram0 bs=4k iflag=direct | sha256sum

 # read the same data from /dev/mapper/eram0
 sudo dd if=/dev/mapper/eram0 bs=4k iflag=direct | sha256sum

 # read the same data from /dev/mapper/inline-eram0
 sudo dd if=/dev/mapper/inline-eram0 bs=4k iflag=direct | sha256sum

 This script creates a ramdisk (to eliminate hardware bias in the 
 benchmark) and
 two dm-crypt instances on top. Both dm-crypt instances use the NULL cipher
 to eliminate potentially expensive crypto bias (the NULL cipher just uses 
 memcpy
 for "encyption"). The first instance is the current dm-crypt 
 implementation from
 5.8-rc1, the second is the dm-crypt instance with a custom new flag 
 enabled from
 the patch attached to this thread. On my VM (Debian in VirtualBox with 4 
 cores
 on 2.8 GHz Quad-Core Intel Core i7) I get the following output (formatted 
 for
 better readability):

 # plain ram0
 1048576+0 records in
 1048576+0 records out
 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 21.2305 s, 202 MB/s
 8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca  -

 # eram0 (current dm-crypt)
 1048576+0 records in
 1048576+0 records out
 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 53.2212 s, 80.7 MB/s
 8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca  -

 # inline-eram0 (patched dm-crypt)
 1048576+0 records in
 1048576+0 records out
 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 21.3472 s, 201 MB/s
 8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca  -

 As we can see, current dm-crypt implementation creates a significant IO
 performance overhead (at least on small IO block sizes) for both latency 
 and
 throughput. We suspect offloading IO request processing into workqueues and
 async threads is more harmful these days with the modern fast storage. I 
 also
 did some digging into the dm-crypt git history and much of this async 
 processing
 is not needed anymore, because the reasons it was added are mostly gone 
 from the
 kernel. More details can be found in [2] (see "Git archeology" section).

 We have been running the attached patch on different hardware generations 
 in
 more than 200 datacentres on both SATA SSDs and NVME SSDs and so far were 
 very
 happy with the performance benefits.

 [1]: https://www.spinics.net/lists/dm-crypt/msg07516.html
 [2]: https://blog.cloudflare.com/speeding-up-linux-disk-encryption/

 Ignat Korchagin (1):
   Add DM_CRYPT_FORCE_INLINE flag to dm-crypt target

  drivers/md/dm-crypt.c | 55 +--
  1 file changed, 43 insertions(+), 12 deletions(-)

 --
 2.20.1

>>>
>>> Hi,
>>>
>>> I saw [2] and have been expecting something from cloudflare ever since.
>>> Nice to see this submission.
>>>
>>> There is useful context in your 0th patch header.  I'll likely merge
>>> parts of this patch header with the more terse 1/1 header (reality is
>>> there only needed to be a single patch submission).
>>>
>>> Will review and stage accordingly if all looks fine to me.  Mikulas,
>>> please have a look too.
>>
>> Very timely: I was about to send a couple of patches to add zoned block 
>> device
>> support to dm-crypt :)
>>
>> I used [1] work as a base to have all _write_ requests be processed inline in
>> the submitter context so that the submission order is preserved, avoiding the
>> potential reordering of sequential writes that the normal workqueue based
>> processing can generate. This inline processing is done only for writes. 
>> Reads
>> are unaffected.
>>
>> To do this, I added a "inline_io" flag to struct convert_context which is
>> initialized in crypt_io_init() based on the BIO op.
>> kcryptd_crypt_write_io_submit() then uses this flag to call directly
>> generic_make_request() if inline_io is true.
>>
>> This simplifies things compared to [1] since reads can still be processed as 
>> 

Re: [dm-devel] [RFC PATCH 0/1] dm-crypt excessive overhead

2020-06-22 Thread Ignat Korchagin
On Mon, Jun 22, 2020 at 1:45 AM Damien Le Moal  wrote:
>
> On 2020/06/20 1:56, Mike Snitzer wrote:
> > On Fri, Jun 19 2020 at 12:41pm -0400,
> > Ignat Korchagin  wrote:
> >
> >> This is a follow up from the long-forgotten [1], but with some more 
> >> convincing
> >> evidence. Consider the following script:
> >>
> >> #!/bin/bash -e
> >>
> >> # create 4G ramdisk
> >> sudo modprobe brd rd_nr=1 rd_size=4194304
> >>
> >> # create a dm-crypt device with NULL cipher on top of /dev/ram0
> >> echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0' | sudo 
> >> dmsetup create eram0
> >>
> >> # create a dm-crypt device with NULL cipher and custom force_inline flag
> >> echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0 1 
> >> force_inline' | sudo dmsetup create inline-eram0
> >>
> >> # read all data from /dev/ram0
> >> sudo dd if=/dev/ram0 bs=4k iflag=direct | sha256sum
> >>
> >> # read the same data from /dev/mapper/eram0
> >> sudo dd if=/dev/mapper/eram0 bs=4k iflag=direct | sha256sum
> >>
> >> # read the same data from /dev/mapper/inline-eram0
> >> sudo dd if=/dev/mapper/inline-eram0 bs=4k iflag=direct | sha256sum
> >>
> >> This script creates a ramdisk (to eliminate hardware bias in the 
> >> benchmark) and
> >> two dm-crypt instances on top. Both dm-crypt instances use the NULL cipher
> >> to eliminate potentially expensive crypto bias (the NULL cipher just uses 
> >> memcpy
> >> for "encyption"). The first instance is the current dm-crypt 
> >> implementation from
> >> 5.8-rc1, the second is the dm-crypt instance with a custom new flag 
> >> enabled from
> >> the patch attached to this thread. On my VM (Debian in VirtualBox with 4 
> >> cores
> >> on 2.8 GHz Quad-Core Intel Core i7) I get the following output (formatted 
> >> for
> >> better readability):
> >>
> >> # plain ram0
> >> 1048576+0 records in
> >> 1048576+0 records out
> >> 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 21.2305 s, 202 MB/s
> >> 8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca  -
> >>
> >> # eram0 (current dm-crypt)
> >> 1048576+0 records in
> >> 1048576+0 records out
> >> 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 53.2212 s, 80.7 MB/s
> >> 8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca  -
> >>
> >> # inline-eram0 (patched dm-crypt)
> >> 1048576+0 records in
> >> 1048576+0 records out
> >> 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 21.3472 s, 201 MB/s
> >> 8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca  -
> >>
> >> As we can see, current dm-crypt implementation creates a significant IO
> >> performance overhead (at least on small IO block sizes) for both latency 
> >> and
> >> throughput. We suspect offloading IO request processing into workqueues and
> >> async threads is more harmful these days with the modern fast storage. I 
> >> also
> >> did some digging into the dm-crypt git history and much of this async 
> >> processing
> >> is not needed anymore, because the reasons it was added are mostly gone 
> >> from the
> >> kernel. More details can be found in [2] (see "Git archeology" section).
> >>
> >> We have been running the attached patch on different hardware generations 
> >> in
> >> more than 200 datacentres on both SATA SSDs and NVME SSDs and so far were 
> >> very
> >> happy with the performance benefits.
> >>
> >> [1]: https://www.spinics.net/lists/dm-crypt/msg07516.html
> >> [2]: https://blog.cloudflare.com/speeding-up-linux-disk-encryption/
> >>
> >> Ignat Korchagin (1):
> >>   Add DM_CRYPT_FORCE_INLINE flag to dm-crypt target
> >>
> >>  drivers/md/dm-crypt.c | 55 +--
> >>  1 file changed, 43 insertions(+), 12 deletions(-)
> >>
> >> --
> >> 2.20.1
> >>
> >
> > Hi,
> >
> > I saw [2] and have been expecting something from cloudflare ever since.
> > Nice to see this submission.
> >
> > There is useful context in your 0th patch header.  I'll likely merge
> > parts of this patch header with the more terse 1/1 header (reality is
> > there only needed to be a single patch submission).
> >
> > Will review and stage accordingly if all looks fine to me.  Mikulas,
> > please have a look too.
>
> Very timely: I was about to send a couple of patches to add zoned block device
> support to dm-crypt :)
>
> I used [1] work as a base to have all _write_ requests be processed inline in
> the submitter context so that the submission order is preserved, avoiding the
> potential reordering of sequential writes that the normal workqueue based
> processing can generate. This inline processing is done only for writes. Reads
> are unaffected.
>
> To do this, I added a "inline_io" flag to struct convert_context which is
> initialized in crypt_io_init() based on the BIO op.
> kcryptd_crypt_write_io_submit() then uses this flag to call directly
> generic_make_request() if inline_io is true.
>
> This simplifies things compared to [1] since reads can still be processed as 
> is,
> so there are no issued with irq context and no need for a tasklet.


Re: [dm-devel] [RFC PATCH 0/1] dm-crypt excessive overhead

2020-06-21 Thread Damien Le Moal
On 2020/06/20 1:56, Mike Snitzer wrote:
> On Fri, Jun 19 2020 at 12:41pm -0400,
> Ignat Korchagin  wrote:
> 
>> This is a follow up from the long-forgotten [1], but with some more 
>> convincing
>> evidence. Consider the following script:
>>
>> #!/bin/bash -e
>>
>> # create 4G ramdisk
>> sudo modprobe brd rd_nr=1 rd_size=4194304
>>
>> # create a dm-crypt device with NULL cipher on top of /dev/ram0
>> echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0' | sudo dmsetup 
>> create eram0
>>
>> # create a dm-crypt device with NULL cipher and custom force_inline flag
>> echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0 1 force_inline' 
>> | sudo dmsetup create inline-eram0
>>
>> # read all data from /dev/ram0
>> sudo dd if=/dev/ram0 bs=4k iflag=direct | sha256sum
>>
>> # read the same data from /dev/mapper/eram0
>> sudo dd if=/dev/mapper/eram0 bs=4k iflag=direct | sha256sum
>>
>> # read the same data from /dev/mapper/inline-eram0
>> sudo dd if=/dev/mapper/inline-eram0 bs=4k iflag=direct | sha256sum
>>
>> This script creates a ramdisk (to eliminate hardware bias in the benchmark) 
>> and
>> two dm-crypt instances on top. Both dm-crypt instances use the NULL cipher
>> to eliminate potentially expensive crypto bias (the NULL cipher just uses 
>> memcpy
>> for "encyption"). The first instance is the current dm-crypt implementation 
>> from
>> 5.8-rc1, the second is the dm-crypt instance with a custom new flag enabled 
>> from
>> the patch attached to this thread. On my VM (Debian in VirtualBox with 4 
>> cores
>> on 2.8 GHz Quad-Core Intel Core i7) I get the following output (formatted for
>> better readability):
>>
>> # plain ram0
>> 1048576+0 records in
>> 1048576+0 records out
>> 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 21.2305 s, 202 MB/s
>> 8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca  -
>>
>> # eram0 (current dm-crypt)
>> 1048576+0 records in
>> 1048576+0 records out
>> 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 53.2212 s, 80.7 MB/s
>> 8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca  -
>>
>> # inline-eram0 (patched dm-crypt)
>> 1048576+0 records in
>> 1048576+0 records out
>> 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 21.3472 s, 201 MB/s
>> 8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca  -
>>
>> As we can see, current dm-crypt implementation creates a significant IO
>> performance overhead (at least on small IO block sizes) for both latency and
>> throughput. We suspect offloading IO request processing into workqueues and
>> async threads is more harmful these days with the modern fast storage. I also
>> did some digging into the dm-crypt git history and much of this async 
>> processing
>> is not needed anymore, because the reasons it was added are mostly gone from 
>> the
>> kernel. More details can be found in [2] (see "Git archeology" section).
>>
>> We have been running the attached patch on different hardware generations in
>> more than 200 datacentres on both SATA SSDs and NVME SSDs and so far were 
>> very
>> happy with the performance benefits.
>>
>> [1]: https://www.spinics.net/lists/dm-crypt/msg07516.html
>> [2]: https://blog.cloudflare.com/speeding-up-linux-disk-encryption/
>>
>> Ignat Korchagin (1):
>>   Add DM_CRYPT_FORCE_INLINE flag to dm-crypt target
>>
>>  drivers/md/dm-crypt.c | 55 +--
>>  1 file changed, 43 insertions(+), 12 deletions(-)
>>
>> -- 
>> 2.20.1
>>
> 
> Hi,
> 
> I saw [2] and have been expecting something from cloudflare ever since.
> Nice to see this submission.
> 
> There is useful context in your 0th patch header.  I'll likely merge
> parts of this patch header with the more terse 1/1 header (reality is
> there only needed to be a single patch submission).
> 
> Will review and stage accordingly if all looks fine to me.  Mikulas,
> please have a look too.

Very timely: I was about to send a couple of patches to add zoned block device
support to dm-crypt :)

I used [1] work as a base to have all _write_ requests be processed inline in
the submitter context so that the submission order is preserved, avoiding the
potential reordering of sequential writes that the normal workqueue based
processing can generate. This inline processing is done only for writes. Reads
are unaffected.

To do this, I added a "inline_io" flag to struct convert_context which is
initialized in crypt_io_init() based on the BIO op.
kcryptd_crypt_write_io_submit() then uses this flag to call directly
generic_make_request() if inline_io is true.

This simplifies things compared to [1] since reads can still be processed as is,
so there are no issued with irq context and no need for a tasklet.

Should I send these patches as RFC to see what can be merged ? Or I can wait for
these patches and rebase on top.

> 
> Thanks,
> Mike
> 
> --
> dm-devel mailing list
> dm-de...@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
> 
> 


-- 
Damien Le Moal
Western Digital Research