On Wed, May 25, 2016 at 11:30:01PM +0900, Sergey Senozhatsky wrote:
> We don't have an idle zstreams list anymore and our write path
> now works absolutely differently, preventing preemption during
> compression. This removes possibilities of read paths preempting
> writes at wrong places (which could badly affect the performance
> of both paths) and at the same time opens the door for a move
> from custom LZO/LZ4 compression backends implementation to a more
> generic one, using crypto compress API.
> 
> Joonsoo Kim [1] attempted to do this a while ago, but faced with
> the need of introducing a new crypto API interface. The root cause
> was the fact that crypto API compression algorithms require a
> compression stream structure (in zram terminology) for both
> compression and decompression ops, while in reality only several
> of compression algorithms really need it. This resulted in a
> concept of context-less crypto API compression backends [2]. Both
> write and read paths, though, would have been executed with the
> preemption enabled, which in the worst case could have resulted
> in a decreased worst-case performance, e.g. consider the
> following case:
> 
>       CPU0
> 
>       zram_write()
>         spin_lock()
>           take the last idle stream
>         spin_unlock()
> 
>       << preempted >>
> 
>               zram_read()
>                 spin_lock()
>                  no idle streams
>                         spin_unlock()
>                         schedule()
> 
>       resuming zram_write compression()
> 
> but it took me some time to realize that, and it took even longer
> to evolve zram and to make it ready for crypto API. The key turned
> out to be -- drop the idle streams list entirely. With out the

                                                    Without

> idle streams list we are free to use compression algorithms that
> require compression stream for decompression (read), because streams
> are now placed in per-cpu data and each write path has to disable
> preemption for compression op, almost completely eliminating the
> aforementioned case (technically, we still have a small chance,
> because write path has a fast and a slow paths and the slow path
> is executed with the preemption enabled; but the frequency of
> failed fast path is too low).

Nice description.

> 
> TEST
> ====
> 
> - 4 CPUs, x86_64 system
> - 3G zram, lzo
> - fio tests: read, randread, write, randwrite, rw, randrw
> 
> test script [3] command:
>  ZRAM_SIZE=3G LOG_SUFFIX=XXXX FIO_LOOPS=5 ./zram-fio-test.sh
> 
>                    BASE           PATCHED
> jobs1
> READ:           2527.2MB/s     2482.7MB/s
> READ:           2102.7MB/s     2045.0MB/s
> WRITE:          1284.3MB/s     1324.3MB/s
> WRITE:          1080.7MB/s     1101.9MB/s
> READ:           430125KB/s     437498KB/s
> WRITE:          430538KB/s     437919KB/s
> READ:           399593KB/s     403987KB/s
> WRITE:          399910KB/s     404308KB/s
> jobs2
> READ:           8133.5MB/s     7854.8MB/s
> READ:           7086.6MB/s     6912.8MB/s
> WRITE:          3177.2MB/s     3298.3MB/s
> WRITE:          2810.2MB/s     2871.4MB/s
> READ:           1017.6MB/s     1023.4MB/s
> WRITE:          1018.2MB/s     1023.1MB/s
> READ:           977836KB/s     984205KB/s
> WRITE:          979435KB/s     985814KB/s
> jobs3
> READ:           13557MB/s      13391MB/s
> READ:           11876MB/s      11752MB/s
> WRITE:          4641.5MB/s     4682.1MB/s
> WRITE:          4164.9MB/s     4179.3MB/s
> READ:           1453.8MB/s     1455.1MB/s
> WRITE:          1455.1MB/s     1458.2MB/s
> READ:           1387.7MB/s     1395.7MB/s
> WRITE:          1386.1MB/s     1394.9MB/s
> jobs4
> READ:           20271MB/s      20078MB/s
> READ:           18033MB/s      17928MB/s
> WRITE:          6176.8MB/s     6180.5MB/s
> WRITE:          5686.3MB/s     5705.3MB/s
> READ:           2009.4MB/s     2006.7MB/s
> WRITE:          2007.5MB/s     2004.9MB/s
> READ:           1929.7MB/s     1935.6MB/s
> WRITE:          1926.8MB/s     1932.6MB/s
> jobs5
> READ:           18823MB/s      19024MB/s
> READ:           18968MB/s      19071MB/s
> WRITE:          6191.6MB/s     6372.1MB/s
> WRITE:          5818.7MB/s     5787.1MB/s
> READ:           2011.7MB/s     1981.3MB/s
> WRITE:          2011.4MB/s     1980.1MB/s
> READ:           1949.3MB/s     1935.7MB/s
> WRITE:          1940.4MB/s     1926.1MB/s
> jobs6
> READ:           21870MB/s      21715MB/s
> READ:           19957MB/s      19879MB/s
> WRITE:          6528.4MB/s     6537.6MB/s
> WRITE:          6098.9MB/s     6073.6MB/s
> READ:           2048.6MB/s     2049.9MB/s
> WRITE:          2041.7MB/s     2042.9MB/s
> READ:           2013.4MB/s     1990.4MB/s
> WRITE:          2009.4MB/s     1986.5MB/s
> jobs7
> READ:           21359MB/s      21124MB/s
> READ:           19746MB/s      19293MB/s
> WRITE:          6660.4MB/s     6518.8MB/s
> WRITE:          6211.6MB/s     6193.1MB/s
> READ:           2089.7MB/s     2080.6MB/s
> WRITE:          2085.8MB/s     2076.5MB/s
> READ:           2041.2MB/s     2052.5MB/s
> WRITE:          2037.5MB/s     2048.8MB/s
> jobs8
> READ:           20477MB/s      19974MB/s
> READ:           18922MB/s      18576MB/s
> WRITE:          6851.9MB/s     6788.3MB/s
> WRITE:          6407.7MB/s     6347.5MB/s
> READ:           2134.8MB/s     2136.1MB/s
> WRITE:          2132.8MB/s     2134.4MB/s
> READ:           2074.2MB/s     2069.6MB/s
> WRITE:          2087.3MB/s     2082.4MB/s
> jobs9
> READ:           19797MB/s      19994MB/s
> READ:           18806MB/s      18581MB/s
> WRITE:          6878.7MB/s     6822.7MB/s
> WRITE:          6456.8MB/s     6447.2MB/s
> READ:           2141.1MB/s     2154.7MB/s
> WRITE:          2144.4MB/s     2157.3MB/s
> READ:           2084.1MB/s     2085.1MB/s
> WRITE:          2091.5MB/s     2092.5MB/s
> jobs10
> READ:           19794MB/s      19784MB/s
> READ:           18794MB/s      18745MB/s
> WRITE:          6984.4MB/s     6676.3MB/s
> WRITE:          6532.3MB/s     6342.7MB/s
> READ:           2150.6MB/s     2155.4MB/s
> WRITE:          2156.8MB/s     2161.5MB/s
> READ:           2106.4MB/s     2095.6MB/s
> WRITE:          2109.7MB/s     2098.4MB/s
> 
>                                     BASE                       PATCHED
> jobs1                              perfstat
> stalled-cycles-frontend     102,480,595,419 (  41.53%)          
> 114,508,864,804 (  46.92%)
> stalled-cycles-backend       51,941,417,832 (  21.05%)           
> 46,836,112,388 (  19.19%)
> instructions                283,612,054,215 (    1.15)          
> 283,918,134,959 (    1.16)
> branches                     56,372,560,385 ( 724.923)           
> 56,449,814,753 ( 733.766)
> branch-misses                   374,826,000 (   0.66%)              
> 326,935,859 (   0.58%)
> jobs2                              perfstat
> stalled-cycles-frontend     155,142,745,777 (  40.99%)          
> 164,170,979,198 (  43.82%)
> stalled-cycles-backend       70,813,866,387 (  18.71%)           
> 66,456,858,165 (  17.74%)
> instructions                463,436,648,173 (    1.22)          
> 464,221,890,191 (    1.24)
> branches                     91,088,733,902 ( 760.088)           
> 91,278,144,546 ( 769.133)
> branch-misses                   504,460,363 (   0.55%)              
> 394,033,842 (   0.43%)
> jobs3                              perfstat
> stalled-cycles-frontend     201,300,397,212 (  39.84%)          
> 223,969,902,257 (  44.44%)
> stalled-cycles-backend       87,712,593,974 (  17.36%)           
> 81,618,888,712 (  16.19%)
> instructions                642,869,545,023 (    1.27)          
> 644,677,354,132 (    1.28)
> branches                    125,724,560,594 ( 690.682)          
> 126,133,159,521 ( 694.542)
> branch-misses                   527,941,798 (   0.42%)              
> 444,782,220 (   0.35%)
> jobs4                              perfstat
> stalled-cycles-frontend     246,701,197,429 (  38.12%)          
> 280,076,030,886 (  43.29%)
> stalled-cycles-backend      119,050,341,112 (  18.40%)          
> 110,955,641,671 (  17.15%)
> instructions                822,716,962,127 (    1.27)          
> 825,536,969,320 (    1.28)
> branches                    160,590,028,545 ( 688.614)          
> 161,152,996,915 ( 691.068)
> branch-misses                   650,295,287 (   0.40%)              
> 550,229,113 (   0.34%)
> jobs5                              perfstat
> stalled-cycles-frontend     298,958,462,516 (  38.30%)          
> 344,852,200,358 (  44.16%)
> stalled-cycles-backend      137,558,742,122 (  17.62%)          
> 129,465,067,102 (  16.58%)
> instructions              1,005,714,688,752 (    1.29)        
> 1,007,657,999,432 (    1.29)
> branches                    195,988,773,962 ( 697.730)          
> 196,446,873,984 ( 700.319)
> branch-misses                   695,818,940 (   0.36%)              
> 624,823,263 (   0.32%)
> jobs6                              perfstat
> stalled-cycles-frontend     334,497,602,856 (  36.71%)          
> 387,590,419,779 (  42.38%)
> stalled-cycles-backend      163,539,365,335 (  17.95%)          
> 152,640,193,639 (  16.69%)
> instructions              1,184,738,177,851 (    1.30)        
> 1,187,396,281,677 (    1.30)
> branches                    230,592,915,640 ( 702.902)          
> 231,253,802,882 ( 702.356)
> branch-misses                   747,934,786 (   0.32%)              
> 643,902,424 (   0.28%)
> jobs7                              perfstat
> stalled-cycles-frontend     396,724,684,187 (  37.71%)          
> 460,705,858,952 (  43.84%)
> stalled-cycles-backend      188,096,616,496 (  17.88%)          
> 175,785,787,036 (  16.73%)
> instructions              1,364,041,136,608 (    1.30)        
> 1,366,689,075,112 (    1.30)
> branches                    265,253,096,936 ( 700.078)          
> 265,890,524,883 ( 702.839)
> branch-misses                   784,991,589 (   0.30%)              
> 729,196,689 (   0.27%)
> jobs8                              perfstat
> stalled-cycles-frontend     440,248,299,870 (  36.92%)          
> 509,554,793,816 (  42.46%)
> stalled-cycles-backend      222,575,930,616 (  18.67%)          
> 213,401,248,432 (  17.78%)
> instructions              1,542,262,045,114 (    1.29)        
> 1,545,233,932,257 (    1.29)
> branches                    299,775,178,439 ( 697.666)          
> 300,528,458,505 ( 694.769)
> branch-misses                   847,496,084 (   0.28%)              
> 748,794,308 (   0.25%)
> jobs9                              perfstat
> stalled-cycles-frontend     506,269,882,480 (  37.86%)          
> 592,798,032,820 (  44.43%)
> stalled-cycles-backend      253,192,498,861 (  18.93%)          
> 233,727,666,185 (  17.52%)
> instructions              1,721,985,080,913 (    1.29)        
> 1,724,666,236,005 (    1.29)
> branches                    334,517,360,255 ( 694.134)          
> 335,199,758,164 ( 697.131)
> branch-misses                   873,496,730 (   0.26%)              
> 815,379,236 (   0.24%)
> jobs10                             perfstat
> stalled-cycles-frontend     549,063,363,749 (  37.18%)          
> 651,302,376,662 (  43.61%)
> stalled-cycles-backend      281,680,986,810 (  19.07%)          
> 277,005,235,582 (  18.55%)
> instructions              1,901,859,271,180 (    1.29)        
> 1,906,311,064,230 (    1.28)
> branches                    369,398,536,153 ( 694.004)          
> 370,527,696,358 ( 688.409)
> branch-misses                   967,929,335 (   0.26%)              
> 890,125,056 (   0.24%)
> 
>                             BASE           PATCHED
> seconds elapsed        79.421641008   78.735285546
> seconds elapsed        61.471246133   60.869085949
> seconds elapsed        62.317058173   62.224188495
> seconds elapsed        60.030739363   60.081102518
> seconds elapsed        74.070398362   74.317582865
> seconds elapsed        84.985953007   85.414364176
> seconds elapsed        97.724553255   98.173311344
> seconds elapsed        109.488066758  110.268399318
> seconds elapsed        122.768189405  122.967164498
> seconds elapsed        135.130035105  136.934770801
> 
> On my other system (8 x86_64 CPUs, short version of test results):
> 
>                             BASE           PATCHED
> seconds elapsed        19.518065994   19.806320662
> seconds elapsed        15.172772749   15.594718291
> seconds elapsed        13.820925970   13.821708564
> seconds elapsed        13.293097816   14.585206405
> seconds elapsed        16.207284118   16.064431606
> seconds elapsed        17.958376158   17.771825767
> seconds elapsed        19.478009164   19.602961508
> seconds elapsed        21.347152811   21.352318709
> seconds elapsed        24.478121126   24.171088735
> seconds elapsed        26.865057442   26.767327618
> 
> So performance-wise the numbers are quite similar.
> 
> [1] http://marc.info/?l=linux-kernel&m=144480832108927&w=2
> [2] http://marc.info/?l=linux-kernel&m=145379613507518&w=2
> [3] https://github.com/sergey-senozhatsky/zram-perf-test
> 
> Signed-off-by: Sergey Senozhatsky <sergey.senozhat...@gmail.com>
> Suggested-by: Minchan Kim <minc...@kernel.org>
> Suggested-by: Joonsoo Kim <iamjoonsoo....@lge.com>
> ---
>  drivers/block/zram/Kconfig    | 10 +++----
>  drivers/block/zram/zcomp.c    | 66 
> ++++++++++++++++++++++++++++---------------
>  drivers/block/zram/zcomp.h    |  7 +++--
>  drivers/block/zram/zram_drv.c | 14 +++++----
>  4 files changed, 61 insertions(+), 36 deletions(-)
> 
> diff --git a/drivers/block/zram/Kconfig b/drivers/block/zram/Kconfig
> index 386ba3d..2252cd7 100644
> --- a/drivers/block/zram/Kconfig
> +++ b/drivers/block/zram/Kconfig
> @@ -1,8 +1,7 @@
>  config ZRAM
>       tristate "Compressed RAM block device support"
> -     depends on BLOCK && SYSFS && ZSMALLOC
> -     select LZO_COMPRESS
> -     select LZO_DECOMPRESS
> +     depends on BLOCK && SYSFS && ZSMALLOC && CRYPTO
> +     select CRYPTO_LZO
>       default n
>       help
>         Creates virtual block devices called /dev/zramX (X = 0, 1, ...).
> @@ -18,9 +17,8 @@ config ZRAM
>  config ZRAM_LZ4_COMPRESS
>       bool "Enable LZ4 algorithm support"
>       depends on ZRAM
> -     select LZ4_COMPRESS
> -     select LZ4_DECOMPRESS
> +     select CRYPTO_LZ4
>       default n
>       help
>         This option enables LZ4 compression algorithm support. Compression
> -       algorithm can be changed using `comp_algorithm' device attribute.
> \ No newline at end of file
> +       algorithm can be changed using `comp_algorithm' device attribute.
> diff --git a/drivers/block/zram/zcomp.c b/drivers/block/zram/zcomp.c
> index 400f826..82b568e 100644
> --- a/drivers/block/zram/zcomp.c
> +++ b/drivers/block/zram/zcomp.c
> @@ -14,26 +14,23 @@
>  #include <linux/wait.h>
>  #include <linux/sched.h>
>  #include <linux/cpu.h>
> +#include <linux/crypto.h>
>  
>  #include "zcomp.h"
> -#include "zcomp_lzo.h"
> -#ifdef CONFIG_ZRAM_LZ4_COMPRESS
> -#include "zcomp_lz4.h"
> -#endif
>  
> -static struct zcomp_backend *backends[] = {
> -     &zcomp_lzo,
> +static const char * const backends[] = {
> +     "lzo",
>  #ifdef CONFIG_ZRAM_LZ4_COMPRESS
> -     &zcomp_lz4,
> +     "lz4",
>  #endif
>       NULL
>  };
>  
> -static struct zcomp_backend *find_backend(const char *compress)
> +static const char *find_backend(const char *compress)
>  {
>       int i = 0;
>       while (backends[i]) {
> -             if (sysfs_streq(compress, backends[i]->name))
> +             if (sysfs_streq(compress, backends[i]))
>                       break;
>               i++;
>       }
> @@ -42,8 +39,8 @@ static struct zcomp_backend *find_backend(const char 
> *compress)
>  
>  static void zcomp_strm_free(struct zcomp *comp, struct zcomp_strm *zstrm)
>  {
> -     if (zstrm->private)
> -             comp->backend->destroy(zstrm->private);
> +     if (!IS_ERR_OR_NULL(zstrm->private))

Let's change private with tfm.


> +             crypto_free_comp(zstrm->private);
>       free_pages((unsigned long)zstrm->buffer, 1);
>       kfree(zstrm);
>  }
> @@ -58,13 +55,13 @@ static struct zcomp_strm *zcomp_strm_alloc(struct zcomp 
> *comp, gfp_t flags)
>       if (!zstrm)
>               return NULL;
>  
> -     zstrm->private = comp->backend->create(flags);
> +     zstrm->private = crypto_alloc_comp(comp->name, 0, 0);

crypto_alloc_comp uses GPF_KERNEL for allocating tfm and zram uses
GFP_KERNEL for zcomp_strm_alloc now so there is no point to pass
gfp_t so let's clean it up.

Otherwise, looks good to me at af first glance.

 

Reply via email to