Re: [patch v3]DM: dm-insitu-comp: a compressed DM target for SSD

Shaohua Li Thu, 06 Mar 2014 23:59:15 -0800

ping!


On Tue, Feb 18, 2014 at 06:13:04PM +0800, Shaohua Li wrote:
> 
> This is a simple DM target supporting compression for SSD only. Under layer 
> SSD
> must support 512B sector size, the target only supports 4k sector size.
> 
> Disk layout:
> |super|...meta...|..data...|
> 
> Store unit is 4k (a block). Super is 1 block, which stores meta and data size
> and compression algorithm. Meta is a bitmap. For each data block, there are 5
> bits meta.
> 
> Data:
> Data of a block is compressed. Compressed data is round up to 512B, which is
> the payload. In disk, payload is stored at the begining of logical sector of
> the block. Let's look at an example. Say we store data to block A, which is in
> sector B(A*8), its orginal size is 4k, compressed size is 1500. Compressed 
> data
> (CD) will use 3 sectors (512B). The 3 sectors are the payload. Payload will be
> stored at sector B.
> 
> ---------------------------------------------------
> ... | CD1 | CD2 | CD3 |   |   |   |   |    | ...
> ---------------------------------------------------
>     ^B    ^B+1  ^B+2                  ^B+7 ^B+8
> 
> For this block, we will not use sector B+3 to B+7 (a hole). We use 4 meta bits
> to present payload size. The compressed size (1500) isn't stored in meta
> directly. Instead, we store it at the last 32bits of payload. In this example,
> we store it at the end of sector B+2. If compressed size + sizeof(32bits)
> crosses a sector, payload size will increase one sector. If payload uses 8
> sectors, we store uncompressed data directly.
> 
> If IO size is bigger than one block, we can store the data as an extent. Data
> of the whole extent will compressed and stored in the similar way like above.
> The first block of the extent is the head, all others are the tail. If extent
> is 1 block, the block is head. We have 1 bit of meta to present if a block is
> head or tail. If 4 meta bits of head block can't store extent payload size, we
> will borrow tail block meta bits to store payload size. Max allowd extent size
> is 128k, so we don't compress/decompress too big size data.
> 
> Meta:
> Modifying data will modify meta too. Meta will be written(flush) to disk
> depending on meta write policy. We support writeback and writethrough mode. In
> writeback mode, meta will be written to disk in an interval or a FLUSH 
> request.
> In writethrough mode, data and meta data will be written to disk together.
> 
> Advantages:
> 1. simple. Since we store compressed data in-place, we don't need complicated
> disk data management.
> 2. efficient. For each 4k, we only need 5 bits meta. 1T data will use less 
> than
> 200M meta, so we can load all meta into memory. And actual compression size is
> in payload. So if IO doesn't need RMW and we use writeback meta flush, we 
> don't
> need extra IO for meta.
> 
> Disadvantages:
> 1. hole. Since we store compressed data in-place, there are a lot of holes (in
> above example, B+3 - B+7) Hole can impact IO, because we can't do IO merge.
> 2. 1:1 size. Compression doesn't change disk size. If disk is 1T, we can only 
> store
> 1T data even we do compression.
> 
> But this is for SSD only. Generally SSD firmware has a FTL layer to map disk
> sectors to flash nand. High end SSD firmware has filesystem-like FTL.
> 1. hole. Disk has a lot of holes, but SSD FTL can still store data continuous
> in nand. Even we can't do IO merge in OS layer, SSD firmware can do it.
> 2. 1:1 size. On one side, we write compressed data to SSD, which means less
> data is written to SSD. This will be very helpful to improve SSD garbage
> collection, and so write speed and life cycle. So even this is a problem, the
> target is still helpful. On the other side, advanced SSD FTL can easily do 
> thin
> provision. For example, if nand is 1T and we let SSD report it as 2T, and use
> the SSD as compressed target. In such SSD, we don't have the 1:1 size issue.
> 
> So if SSD FTL can map non-continuous disk sectors to continuous nand and
> support thin provision, the compressed target will work very well.
> 
> V2->V3:
> Updated with new bio iter API
> 
> V1->V2:
> 1. Change name to insitu_comp, cleanup code, add comments and doc
> 2. Improve performance (extent locking, dedicated workqueue)
> 
> Signed-off-by: Shaohua Li <s...@fusionio.com>
> ---
>  Documentation/device-mapper/insitu-comp.txt |   50 
>  drivers/md/Kconfig                          |    6 
>  drivers/md/Makefile                         |    1 
>  drivers/md/dm-insitu-comp.c                 | 1480 
> ++++++++++++++++++++++++++++
>  drivers/md/dm-insitu-comp.h                 |  158 ++
>  5 files changed, 1695 insertions(+)
> 
> Index: linux/drivers/md/Kconfig
> ===================================================================
> --- linux.orig/drivers/md/Kconfig     2014-02-17 17:34:45.431464714 +0800
> +++ linux/drivers/md/Kconfig  2014-02-17 17:34:45.423464815 +0800
> @@ -295,6 +295,12 @@ config DM_CACHE_CLEANER
>           A simple cache policy that writes back all data to the
>           origin.  Used when decommissioning a dm-cache.
>  
> +config DM_INSITU_COMPRESSION
> +       tristate "Insitu compression target"
> +       depends on BLK_DEV_DM
> +       ---help---
> +         Allow volume managers to insitu compress data for SSD.
> +
>  config DM_MIRROR
>         tristate "Mirror target"
>         depends on BLK_DEV_DM
> Index: linux/drivers/md/Makefile
> ===================================================================
> --- linux.orig/drivers/md/Makefile    2014-02-17 17:34:45.431464714 +0800
> +++ linux/drivers/md/Makefile 2014-02-17 17:34:45.423464815 +0800
> @@ -53,6 +53,7 @@ obj-$(CONFIG_DM_VERITY)             += dm-verity.o
>  obj-$(CONFIG_DM_CACHE)               += dm-cache.o
>  obj-$(CONFIG_DM_CACHE_MQ)    += dm-cache-mq.o
>  obj-$(CONFIG_DM_CACHE_CLEANER)       += dm-cache-cleaner.o
> +obj-$(CONFIG_DM_INSITU_COMPRESSION)          += dm-insitu-comp.o
>  
>  ifeq ($(CONFIG_DM_UEVENT),y)
>  dm-mod-objs                  += dm-uevent.o
> Index: linux/drivers/md/dm-insitu-comp.c
> ===================================================================
> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> +++ linux/drivers/md/dm-insitu-comp.c 2014-02-17 20:16:38.093360018 +0800
> @@ -0,0 +1,1480 @@
> +#include <linux/module.h>
> +#include <linux/init.h>
> +#include <linux/blkdev.h>
> +#include <linux/bio.h>
> +#include <linux/slab.h>
> +#include <linux/device-mapper.h>
> +#include <linux/dm-io.h>
> +#include <linux/crypto.h>
> +#include <linux/lzo.h>
> +#include <linux/kthread.h>
> +#include <linux/page-flags.h>
> +#include <linux/completion.h>
> +#include "dm-insitu-comp.h"
> +
> +#define DM_MSG_PREFIX "dm_insitu_comp"
> +
> +static struct insitu_comp_compressor_data compressors[] = {
> +     [INSITU_COMP_ALG_LZO] = {
> +             .name = "lzo",
> +             .comp_len = lzo_comp_len,
> +     },
> +     [INSITU_COMP_ALG_ZLIB] = {
> +             .name = "deflate",
> +     },
> +};
> +static int default_compressor;
> +
> +static struct kmem_cache *insitu_comp_io_range_cachep;
> +static struct kmem_cache *insitu_comp_meta_io_cachep;
> +
> +static struct insitu_comp_io_worker insitu_comp_io_workers[NR_CPUS];
> +static struct workqueue_struct *insitu_comp_wq;
> +
> +/* each block has 5 bits metadata */
> +static u8 insitu_comp_get_meta(struct insitu_comp_info *info, u64 
> block_index)
> +{
> +     u64 first_bit = block_index * INSITU_COMP_META_BITS;
> +     int bits, offset;
> +     u8 data, ret = 0;
> +
> +     offset = first_bit & 7;
> +     bits = min_t(u8, INSITU_COMP_META_BITS, 8 - offset);
> +
> +     data = info->meta_bitmap[first_bit >> 3];
> +     ret = (data >> offset) & ((1 << bits) - 1);
> +
> +     if (bits < INSITU_COMP_META_BITS) {
> +             data = info->meta_bitmap[(first_bit >> 3) + 1];
> +             bits = INSITU_COMP_META_BITS - bits;
> +             ret |= (data & ((1 << bits) - 1)) <<
> +                     (INSITU_COMP_META_BITS - bits);
> +     }
> +     return ret;
> +}
> +
> +static void insitu_comp_set_meta(struct insitu_comp_info *info,
> +     u64 block_index, u8 meta, bool dirty_meta)
> +{
> +     u64 first_bit = block_index * INSITU_COMP_META_BITS;
> +     int bits, offset;
> +     u8 data;
> +     struct page *page;
> +
> +     offset = first_bit & 7;
> +     bits = min_t(u8, INSITU_COMP_META_BITS, 8 - offset);
> +
> +     data = info->meta_bitmap[first_bit >> 3];
> +     data &= ~(((1 << bits) - 1) << offset);
> +     data |= (meta & ((1 << bits) - 1)) << offset;
> +     info->meta_bitmap[first_bit >> 3] = data;
> +
> +     /*
> +      * For writethrough, we write metadata directly. For writeback, if
> +      * request is FUA, we do this too; otherwise we just dirty the page,
> +      * which will be flush out in an interval
> +      */
> +     if (info->write_mode == INSITU_COMP_WRITE_BACK) {
> +             page = vmalloc_to_page(&info->meta_bitmap[first_bit >> 3]);
> +             if (dirty_meta)
> +                     SetPageDirty(page);
> +             else
> +                     ClearPageDirty(page);
> +     }
> +
> +     if (bits < INSITU_COMP_META_BITS) {
> +             meta >>= bits;
> +             data = info->meta_bitmap[(first_bit >> 3) + 1];
> +             bits = INSITU_COMP_META_BITS - bits;
> +             data = (data >> bits) << bits;
> +             data |= meta & ((1 << bits) - 1);
> +             info->meta_bitmap[(first_bit >> 3) + 1] = data;
> +
> +             if (info->write_mode == INSITU_COMP_WRITE_BACK) {
> +                     page = vmalloc_to_page(&info->meta_bitmap[
> +                                             (first_bit >> 3) + 1]);
> +                     if (dirty_meta)
> +                             SetPageDirty(page);
> +                     else
> +                             ClearPageDirty(page);
> +             }
> +     }
> +}
> +
> +/*
> + * set metadata for an extent since block @block_index, length is
> + * @logical_blocks.  The extent uses @data_sectors sectors
> + */
> +static void insitu_comp_set_extent(struct insitu_comp_req *req,
> +     u64 block_index, u16 logical_blocks, sector_t data_sectors)
> +{
> +     int i;
> +     u8 data;
> +
> +     for (i = 0; i < logical_blocks; i++) {
> +             data = min_t(sector_t, data_sectors, 8);
> +             data_sectors -= data;
> +             if (i != 0)
> +                     data |= INSITU_COMP_TAIL_MASK;
> +             /* For FUA, we write out meta data directly */
> +             insitu_comp_set_meta(req->info, block_index + i, data,
> +                                     !(insitu_req_rw(req) & REQ_FUA));
> +     }
> +}
> +
> +/*
> + * get metadata for an extent covering block @block_index. @first_block_index
> + * returns the first block of the extent. @logical_sectors returns the extent
> + * length. @data_sectors returns the sectors the extent uses
> + */
> +static void insitu_comp_get_extent(struct insitu_comp_info *info,
> +     u64 block_index, u64 *first_block_index, u16 *logical_sectors,
> +     u16 *data_sectors)
> +{
> +     u8 data;
> +
> +     data = insitu_comp_get_meta(info, block_index);
> +     while (data & INSITU_COMP_TAIL_MASK) {
> +             block_index--;
> +             data = insitu_comp_get_meta(info, block_index);
> +     }
> +     *first_block_index = block_index;
> +     *logical_sectors = INSITU_COMP_BLOCK_SIZE >> 9;
> +     *data_sectors = data & INSITU_COMP_LENGTH_MASK;
> +     block_index++;
> +     while (block_index < info->data_blocks) {
> +             data = insitu_comp_get_meta(info, block_index);
> +             if (!(data & INSITU_COMP_TAIL_MASK))
> +                     break;
> +             *logical_sectors += INSITU_COMP_BLOCK_SIZE >> 9;
> +             *data_sectors += data & INSITU_COMP_LENGTH_MASK;
> +             block_index++;
> +     }
> +}
> +
> +static int insitu_comp_access_super(struct insitu_comp_info *info,
> +     void *addr, int rw)
> +{
> +     struct dm_io_region region;
> +     struct dm_io_request req;
> +     unsigned long io_error = 0;
> +     int ret;
> +
> +     region.bdev = info->dev->bdev;
> +     region.sector = 0;
> +     region.count = INSITU_COMP_BLOCK_SIZE >> 9;
> +
> +     req.bi_rw = rw;
> +     req.mem.type = DM_IO_KMEM;
> +     req.mem.offset = 0;
> +     req.mem.ptr.addr = addr;
> +     req.notify.fn = NULL;
> +     req.client = info->io_client;
> +
> +     ret = dm_io(&req, 1, &region, &io_error);
> +     if (ret || io_error)
> +             return -EIO;
> +     return 0;
> +}
> +
> +static void insitu_comp_meta_io_done(unsigned long error, void *context)
> +{
> +     struct insitu_comp_meta_io *meta_io = context;
> +
> +     meta_io->fn(meta_io->data, error);
> +     kmem_cache_free(insitu_comp_meta_io_cachep, meta_io);
> +}
> +
> +static int insitu_comp_write_meta(struct insitu_comp_info *info,
> +     u64 start_page, u64 end_page, void *data,
> +     void (*fn)(void *data, unsigned long error), int rw)
> +{
> +     struct insitu_comp_meta_io *meta_io;
> +
> +     BUG_ON(end_page > info->meta_bitmap_pages);
> +
> +     meta_io = kmem_cache_alloc(insitu_comp_meta_io_cachep, GFP_NOIO);
> +     if (!meta_io) {
> +             fn(data, -ENOMEM);
> +             return -ENOMEM;
> +     }
> +     meta_io->data = data;
> +     meta_io->fn = fn;
> +
> +     meta_io->io_region.bdev = info->dev->bdev;
> +     meta_io->io_region.sector = INSITU_COMP_META_START_SECTOR +
> +                                     (start_page << (PAGE_SHIFT - 9));
> +     meta_io->io_region.count = (end_page - start_page) << (PAGE_SHIFT - 9);
> +
> +     atomic64_add(meta_io->io_region.count << 9, &info->meta_write_size);
> +
> +     meta_io->io_req.bi_rw = rw;
> +     meta_io->io_req.mem.type = DM_IO_VMA;
> +     meta_io->io_req.mem.offset = 0;
> +     meta_io->io_req.mem.ptr.addr = info->meta_bitmap +
> +                                             (start_page << PAGE_SHIFT);
> +     meta_io->io_req.notify.fn = insitu_comp_meta_io_done;
> +     meta_io->io_req.notify.context = meta_io;
> +     meta_io->io_req.client = info->io_client;
> +
> +     dm_io(&meta_io->io_req, 1, &meta_io->io_region, NULL);
> +     return 0;
> +}
> +
> +struct writeback_flush_data {
> +     struct completion complete;
> +     atomic_t cnt;
> +};
> +
> +static void writeback_flush_io_done(void *data, unsigned long error)
> +{
> +     struct writeback_flush_data *wb = data;
> +
> +     if (atomic_dec_return(&wb->cnt))
> +             return;
> +     complete(&wb->complete);
> +}
> +
> +static void insitu_comp_flush_dirty_meta(struct insitu_comp_info *info,
> +                     struct writeback_flush_data *data)
> +{
> +     struct page *page;
> +     u64 start = 0, index;
> +     u32 pending = 0, cnt = 0;
> +     bool dirty;
> +     struct blk_plug plug;
> +
> +     blk_start_plug(&plug);
> +     for (index = 0; index < info->meta_bitmap_pages; index++, cnt++) {
> +             if (cnt == 256) {
> +                     cnt = 0;
> +                     cond_resched();
> +             }
> +
> +             page = vmalloc_to_page(info->meta_bitmap +
> +                                     (index << PAGE_SHIFT));
> +             dirty = TestClearPageDirty(page);
> +
> +             if (pending == 0 && dirty) {
> +                     start = index;
> +                     pending++;
> +                     continue;
> +             } else if (pending == 0)
> +                     continue;
> +             else if (pending > 0 && dirty) {
> +                     pending++;
> +                     continue;
> +             }
> +
> +             /* pending > 0 && !dirty */
> +             atomic_inc(&data->cnt);
> +             insitu_comp_write_meta(info, start, start + pending, data,
> +                     writeback_flush_io_done, WRITE);
> +             pending = 0;
> +     }
> +
> +     if (pending > 0) {
> +             atomic_inc(&data->cnt);
> +             insitu_comp_write_meta(info, start, start + pending, data,
> +                     writeback_flush_io_done, WRITE);
> +     }
> +     blkdev_issue_flush(info->dev->bdev, GFP_NOIO, NULL);
> +     blk_finish_plug(&plug);
> +}
> +
> +/* writeback thread flushs all dirty metadata to disk in an interval */
> +static int insitu_comp_meta_writeback_thread(void *data)
> +{
> +     struct insitu_comp_info *info = data;
> +     struct writeback_flush_data wb;
> +
> +     atomic_set(&wb.cnt, 1);
> +     init_completion(&wb.complete);
> +
> +     while (!kthread_should_stop()) {
> +             schedule_timeout_interruptible(
> +                     msecs_to_jiffies(info->writeback_delay * 1000));
> +             insitu_comp_flush_dirty_meta(info, &wb);
> +     }
> +
> +     insitu_comp_flush_dirty_meta(info, &wb);
> +
> +     writeback_flush_io_done(&wb, 0);
> +     wait_for_completion(&wb.complete);
> +     return 0;
> +}
> +
> +static int insitu_comp_init_meta(struct insitu_comp_info *info, bool new)
> +{
> +     struct dm_io_region region;
> +     struct dm_io_request req;
> +     unsigned long io_error = 0;
> +     struct blk_plug plug;
> +     int ret;
> +     ssize_t len = DIV_ROUND_UP_ULL(info->meta_bitmap_bits, BITS_PER_LONG);
> +
> +     len *= sizeof(unsigned long);
> +
> +     region.bdev = info->dev->bdev;
> +     region.sector = INSITU_COMP_META_START_SECTOR;
> +     region.count = (len + 511) >> 9;
> +
> +     req.mem.type = DM_IO_VMA;
> +     req.mem.offset = 0;
> +     req.mem.ptr.addr = info->meta_bitmap;
> +     req.notify.fn = NULL;
> +     req.client = info->io_client;
> +
> +     blk_start_plug(&plug);
> +     if (new) {
> +             memset(info->meta_bitmap, 0, len);
> +             req.bi_rw = WRITE_FLUSH;
> +             ret = dm_io(&req, 1, &region, &io_error);
> +     } else {
> +             req.bi_rw = READ;
> +             ret = dm_io(&req, 1, &region, &io_error);
> +     }
> +     blk_finish_plug(&plug);
> +
> +     if (ret || io_error) {
> +             info->ti->error = "Access metadata error";
> +             return -EIO;
> +     }
> +
> +     if (info->write_mode == INSITU_COMP_WRITE_BACK) {
> +             info->writeback_tsk = kthread_run(
> +                     insitu_comp_meta_writeback_thread,
> +                     info, "insitu_comp_writeback");
> +             if (!info->writeback_tsk) {
> +                     info->ti->error = "Create writeback thread error";
> +                     return -EINVAL;
> +             }
> +     }
> +
> +     return 0;
> +}
> +
> +static int insitu_comp_alloc_compressor(struct insitu_comp_info *info)
> +{
> +     int i;
> +
> +     for_each_possible_cpu(i) {
> +             info->tfm[i] = crypto_alloc_comp(
> +                     compressors[info->comp_alg].name, 0, 0);
> +             if (IS_ERR(info->tfm[i])) {
> +                     info->tfm[i] = NULL;
> +                     goto err;
> +             }
> +     }
> +     return 0;
> +err:
> +     for_each_possible_cpu(i) {
> +             if (info->tfm[i]) {
> +                     crypto_free_comp(info->tfm[i]);
> +                     info->tfm[i] = NULL;
> +             }
> +     }
> +     return -ENOMEM;
> +}
> +
> +static void insitu_comp_free_compressor(struct insitu_comp_info *info)
> +{
> +     int i;
> +
> +     for_each_possible_cpu(i) {
> +             if (info->tfm[i]) {
> +                     crypto_free_comp(info->tfm[i]);
> +                     info->tfm[i] = NULL;
> +             }
> +     }
> +}
> +
> +static int insitu_comp_read_or_create_super(struct insitu_comp_info *info)
> +{
> +     void *addr;
> +     struct insitu_comp_super_block *super;
> +     u64 total_blocks;
> +     u64 data_blocks, meta_blocks;
> +     u32 rem, cnt;
> +     bool new_super = false;
> +     int ret;
> +     ssize_t len;
> +
> +     total_blocks = i_size_read(info->dev->bdev->bd_inode) >>
> +                                     INSITU_COMP_BLOCK_SHIFT;
> +     data_blocks = total_blocks - 1;
> +     rem = do_div(data_blocks, INSITU_COMP_BLOCK_SIZE * 8 +
> +                     INSITU_COMP_META_BITS);
> +     meta_blocks = data_blocks * INSITU_COMP_META_BITS;
> +     data_blocks *= INSITU_COMP_BLOCK_SIZE * 8;
> +
> +     cnt = rem;
> +     rem /= (INSITU_COMP_BLOCK_SIZE * 8 / INSITU_COMP_META_BITS + 1);
> +     data_blocks += rem * (INSITU_COMP_BLOCK_SIZE * 8 /
> +                             INSITU_COMP_META_BITS);
> +     meta_blocks += rem;
> +
> +     cnt %= (INSITU_COMP_BLOCK_SIZE * 8 / INSITU_COMP_META_BITS + 1);
> +     meta_blocks += 1;
> +     data_blocks += cnt - 1;
> +
> +     info->data_blocks = data_blocks;
> +     info->data_start = (1 + meta_blocks) << INSITU_COMP_BLOCK_SECTOR_SHIFT;
> +
> +     addr = kzalloc(INSITU_COMP_BLOCK_SIZE, GFP_KERNEL);
> +     if (!addr) {
> +             info->ti->error = "Cannot allocate super";
> +             return -ENOMEM;
> +     }
> +
> +     super = addr;
> +     ret = insitu_comp_access_super(info, addr, READ);
> +     if (ret)
> +             goto out;
> +
> +     if (le64_to_cpu(super->magic) == INSITU_COMP_SUPER_MAGIC) {
> +             if (le64_to_cpu(super->version) != INSITU_COMP_VERSION ||
> +                 le64_to_cpu(super->meta_blocks) != meta_blocks ||
> +                 le64_to_cpu(super->data_blocks) != data_blocks) {
> +                     info->ti->error = "Super is invalid";
> +                     ret = -EINVAL;
> +                     goto out;
> +             }
> +             if (!crypto_has_comp(compressors[super->comp_alg].name, 0, 0)) {
> +                     info->ti->error =
> +                                     "Compressor algorithm doesn't support";
> +                     ret = -EINVAL;
> +                     goto out;
> +             }
> +     } else {
> +             super->magic = cpu_to_le64(INSITU_COMP_SUPER_MAGIC);
> +             super->version = cpu_to_le64(INSITU_COMP_VERSION);
> +             super->meta_blocks = cpu_to_le64(meta_blocks);
> +             super->data_blocks = cpu_to_le64(data_blocks);
> +             super->comp_alg = default_compressor;
> +             ret = insitu_comp_access_super(info, addr, WRITE_FUA);
> +             if (ret) {
> +                     info->ti->error = "Access super fails";
> +                     goto out;
> +             }
> +             new_super = true;
> +     }
> +
> +     info->comp_alg = super->comp_alg;
> +     if (insitu_comp_alloc_compressor(info)) {
> +             ret = -ENOMEM;
> +             goto out;
> +     }
> +
> +     info->meta_bitmap_bits = data_blocks * INSITU_COMP_META_BITS;
> +     len = DIV_ROUND_UP_ULL(info->meta_bitmap_bits, BITS_PER_LONG);
> +     len *= sizeof(unsigned long);
> +     info->meta_bitmap_pages = (len + PAGE_SIZE - 1) >> PAGE_SHIFT;
> +     info->meta_bitmap = vmalloc(info->meta_bitmap_pages * PAGE_SIZE);
> +     if (!info->meta_bitmap) {
> +             ret = -ENOMEM;
> +             goto bitmap_err;
> +     }
> +
> +     ret = insitu_comp_init_meta(info, new_super);
> +     if (ret)
> +             goto meta_err;
> +
> +     return 0;
> +meta_err:
> +     vfree(info->meta_bitmap);
> +bitmap_err:
> +     insitu_comp_free_compressor(info);
> +out:
> +     kfree(addr);
> +     return ret;
> +}
> +
> +/*
> + * <dev> <writethough>/<writeback> <meta_commit_delay>
> + */
> +static int insitu_comp_ctr(struct dm_target *ti, unsigned int argc, char 
> **argv)
> +{
> +     struct insitu_comp_info *info;
> +     char write_mode[15];
> +     int ret, i;
> +
> +     if (argc < 2) {
> +             ti->error = "Invalid argument count";
> +             return -EINVAL;
> +     }
> +
> +     info = kzalloc(sizeof(*info), GFP_KERNEL);
> +     if (!info) {
> +             ti->error = "Cannot allocate context";
> +             return -ENOMEM;
> +     }
> +     info->ti = ti;
> +
> +     if (sscanf(argv[1], "%s", write_mode) != 1) {
> +             ti->error = "Invalid argument";
> +             ret = -EINVAL;
> +             goto err_para;
> +     }
> +
> +     if (strcmp(write_mode, "writeback") == 0) {
> +             if (argc != 3) {
> +                     ti->error = "Invalid argument";
> +                     ret = -EINVAL;
> +                     goto err_para;
> +             }
> +             info->write_mode = INSITU_COMP_WRITE_BACK;
> +             if (sscanf(argv[2], "%u", &info->writeback_delay) != 1) {
> +                     ti->error = "Invalid argument";
> +                     ret = -EINVAL;
> +                     goto err_para;
> +             }
> +     } else if (strcmp(write_mode, "writethrough") == 0) {
> +             info->write_mode = INSITU_COMP_WRITE_THROUGH;
> +     } else {
> +             ti->error = "Invalid argument";
> +             ret = -EINVAL;
> +             goto err_para;
> +     }
> +
> +     if (dm_get_device(ti, argv[0], dm_table_get_mode(ti->table),
> +                                                     &info->dev)) {
> +             ti->error = "Can't get device";
> +             ret = -EINVAL;
> +             goto err_para;
> +     }
> +
> +     info->io_client = dm_io_client_create();
> +     if (!info->io_client) {
> +             ti->error = "Can't create io client";
> +             ret = -EINVAL;
> +             goto err_ioclient;
> +     }
> +
> +     if (bdev_logical_block_size(info->dev->bdev) != 512) {
> +             ti->error = "Can't logical block size too big";
> +             ret = -EINVAL;
> +             goto err_blocksize;
> +     }
> +
> +     ret = insitu_comp_read_or_create_super(info);
> +     if (ret)
> +             goto err_blocksize;
> +
> +     for (i = 0; i < BITMAP_HASH_LEN; i++) {
> +             info->bitmap_locks[i].io_running = 0;
> +             spin_lock_init(&info->bitmap_locks[i].wait_lock);
> +             INIT_LIST_HEAD(&info->bitmap_locks[i].wait_list);
> +     }
> +
> +     atomic64_set(&info->compressed_write_size, 0);
> +     atomic64_set(&info->uncompressed_write_size, 0);
> +     atomic64_set(&info->meta_write_size, 0);
> +     ti->num_flush_bios = 1;
> +     /* doesn't support discard yet */
> +     ti->per_bio_data_size = sizeof(struct insitu_comp_req);
> +     ti->private = info;
> +     return 0;
> +err_blocksize:
> +     dm_io_client_destroy(info->io_client);
> +err_ioclient:
> +     dm_put_device(ti, info->dev);
> +err_para:
> +     kfree(info);
> +     return ret;
> +}
> +
> +static void insitu_comp_dtr(struct dm_target *ti)
> +{
> +     struct insitu_comp_info *info = ti->private;
> +
> +     if (info->write_mode == INSITU_COMP_WRITE_BACK)
> +             kthread_stop(info->writeback_tsk);
> +     insitu_comp_free_compressor(info);
> +     vfree(info->meta_bitmap);
> +     dm_io_client_destroy(info->io_client);
> +     dm_put_device(ti, info->dev);
> +     kfree(info);
> +}
> +
> +static u64 insitu_comp_sector_to_block(sector_t sect)
> +{
> +     return sect >> INSITU_COMP_BLOCK_SECTOR_SHIFT;
> +}
> +
> +static struct insitu_comp_hash_lock *
> +insitu_comp_block_hash_lock(struct insitu_comp_info *info, u64 block_index)
> +{
> +     return &info->bitmap_locks[(block_index >> HASH_LOCK_SHIFT) &
> +                     BITMAP_HASH_MASK];
> +}
> +
> +static struct insitu_comp_hash_lock *
> +insitu_comp_trylock_block(struct insitu_comp_info *info,
> +     struct insitu_comp_req *req, u64 block_index)
> +{
> +     struct insitu_comp_hash_lock *hash_lock;
> +
> +     hash_lock = insitu_comp_block_hash_lock(req->info, block_index);
> +
> +     spin_lock_irq(&hash_lock->wait_lock);
> +     if (!hash_lock->io_running) {
> +             hash_lock->io_running = 1;
> +             spin_unlock_irq(&hash_lock->wait_lock);
> +             return hash_lock;
> +     }
> +     list_add_tail(&req->sibling, &hash_lock->wait_list);
> +     spin_unlock_irq(&hash_lock->wait_lock);
> +     return NULL;
> +}
> +
> +static void insitu_comp_queue_req_list(struct insitu_comp_info *info,
> +     struct list_head *list);
> +static void insitu_comp_unlock_block(struct insitu_comp_info *info,
> +     struct insitu_comp_req *req, struct insitu_comp_hash_lock *hash_lock)
> +{
> +     LIST_HEAD(pending_list);
> +     unsigned long flags;
> +
> +     spin_lock_irqsave(&hash_lock->wait_lock, flags);
> +     /* wakeup all pending reqs to avoid live lock */
> +     list_splice_init(&hash_lock->wait_list, &pending_list);
> +     hash_lock->io_running = 0;
> +     spin_unlock_irqrestore(&hash_lock->wait_lock, flags);
> +
> +     insitu_comp_queue_req_list(info, &pending_list);
> +}
> +
> +static void insitu_comp_unlock_req_range(struct insitu_comp_req *req)
> +{
> +     insitu_comp_unlock_block(req->info, req, req->lock);
> +}
> +
> +/* Check comments of HASH_LOCK_SHIFT. each request only need take one lock */
> +static int insitu_comp_lock_req_range(struct insitu_comp_req *req)
> +{
> +     u64 block_index, tmp;
> +
> +     block_index = insitu_comp_sector_to_block(insitu_req_start_sector(req));
> +     tmp = insitu_comp_sector_to_block(insitu_req_end_sector(req) - 1);
> +     BUG_ON(insitu_comp_block_hash_lock(req->info, block_index) !=
> +                     insitu_comp_block_hash_lock(req->info, tmp));
> +
> +     req->lock = insitu_comp_trylock_block(req->info, req, block_index);
> +     if (!req->lock)
> +             return 0;
> +
> +     return 1;
> +}
> +
> +static void insitu_comp_queue_req(struct insitu_comp_info *info,
> +     struct insitu_comp_req *req)
> +{
> +     unsigned long flags;
> +     struct insitu_comp_io_worker *worker =
> +             &insitu_comp_io_workers[req->cpu];
> +
> +     spin_lock_irqsave(&worker->lock, flags);
> +     list_add_tail(&req->sibling, &worker->pending);
> +     spin_unlock_irqrestore(&worker->lock, flags);
> +
> +     queue_work_on(req->cpu, insitu_comp_wq, &worker->work);
> +}
> +
> +static void insitu_comp_queue_req_list(struct insitu_comp_info *info,
> +     struct list_head *list)
> +{
> +     struct insitu_comp_req *req;
> +     while (!list_empty(list)) {
> +             req = list_first_entry(list, struct insitu_comp_req, sibling);
> +             list_del_init(&req->sibling);
> +             insitu_comp_queue_req(info, req);
> +     }
> +}
> +
> +static void insitu_comp_get_req(struct insitu_comp_req *req)
> +{
> +     atomic_inc(&req->io_pending);
> +}
> +
> +static void insitu_comp_free_io_range(struct insitu_comp_io_range *io)
> +{
> +     kfree(io->decomp_data);
> +     kfree(io->comp_data);
> +     kmem_cache_free(insitu_comp_io_range_cachep, io);
> +}
> +
> +static void insitu_comp_put_req(struct insitu_comp_req *req)
> +{
> +     struct insitu_comp_io_range *io;
> +
> +     if (atomic_dec_return(&req->io_pending))
> +             return;
> +
> +     if (req->stage == STAGE_INIT) /* waiting for locking */
> +             return;
> +
> +     if (req->stage == STAGE_READ_DECOMP ||
> +         req->stage == STAGE_WRITE_COMP ||
> +         req->result)
> +             req->stage = STAGE_DONE;
> +
> +     if (req->stage != STAGE_DONE) {
> +             insitu_comp_queue_req(req->info, req);
> +             return;
> +     }
> +
> +     while (!list_empty(&req->all_io)) {
> +             io = list_entry(req->all_io.next, struct insitu_comp_io_range,
> +                     next);
> +             list_del(&io->next);
> +             insitu_comp_free_io_range(io);
> +     }
> +
> +     insitu_comp_unlock_req_range(req);
> +
> +     insitu_req_endio(req, req->result);
> +}
> +
> +static void insitu_comp_io_range_done(unsigned long error, void *context)
> +{
> +     struct insitu_comp_io_range *io = context;
> +
> +     if (error)
> +             io->req->result = error;
> +     insitu_comp_put_req(io->req);
> +}
> +
> +static inline int insitu_comp_compressor_len(struct insitu_comp_info *info,
> +     int len)
> +{
> +     if (compressors[info->comp_alg].comp_len)
> +             return compressors[info->comp_alg].comp_len(len);
> +     return len;
> +}
> +
> +/*
> + * caller should set region.sector, region.count. bi_rw. IO always to/from
> + * comp_data
> + */
> +static struct insitu_comp_io_range *
> +insitu_comp_create_io_range(struct insitu_comp_req *req, int comp_len,
> +     int decomp_len)
> +{
> +     struct insitu_comp_io_range *io;
> +
> +     io = kmem_cache_alloc(insitu_comp_io_range_cachep, GFP_NOIO);
> +     if (!io)
> +             return NULL;
> +
> +     io->comp_data = kmalloc(insitu_comp_compressor_len(req->info, comp_len),
> +                                                             GFP_NOIO);
> +     io->decomp_data = kmalloc(decomp_len, GFP_NOIO);
> +     if (!io->decomp_data || !io->comp_data) {
> +             kfree(io->decomp_data);
> +             kfree(io->comp_data);
> +             kmem_cache_free(insitu_comp_io_range_cachep, io);
> +             return NULL;
> +     }
> +
> +     io->io_req.notify.fn = insitu_comp_io_range_done;
> +     io->io_req.notify.context = io;
> +     io->io_req.client = req->info->io_client;
> +     io->io_req.mem.type = DM_IO_KMEM;
> +     io->io_req.mem.ptr.addr = io->comp_data;
> +     io->io_req.mem.offset = 0;
> +
> +     io->io_region.bdev = req->info->dev->bdev;
> +
> +     io->decomp_len = decomp_len;
> +     io->comp_len = comp_len;
> +     io->req = req;
> +     return io;
> +}
> +
> +static void insitu_comp_req_copy(struct insitu_comp_req *req, off_t req_off, 
> void *buf,
> +             ssize_t len, bool to_buf)
> +{
> +     struct bio *bio = req->bio;
> +     struct bvec_iter iter;
> +     off_t buf_off = 0;
> +     ssize_t size;
> +     void *addr;
> +
> +     iter = bio->bi_iter;
> +     bio_advance_iter(bio, &iter, req_off);
> +
> +     while (len) {
> +             addr = kmap_atomic(bio_iter_page(bio, iter));
> +             size = min_t(ssize_t, len, bio_iter_len(bio, iter));
> +             if (to_buf)
> +                     memcpy(buf + buf_off, addr + bio_iter_offset(bio, iter),
> +                             size);
> +             else
> +                     memcpy(addr + bio_iter_offset(bio, iter), buf + buf_off,
> +                             size);
> +             kunmap_atomic(addr);
> +
> +             buf_off += size;
> +             len -= size;
> +
> +             bio_advance_iter(bio, &iter, size);
> +     }
> +}
> +
> +/*
> + * return value:
> + * < 0 : error
> + * == 0 : ok
> + * == 1 : ok, but comp/decomp is skipped
> + * Compressed data size is roundup of 512, which makes the payload.
> + * We store the actual compressed length in the last u32 of the payload.
> + * If there is no free space, we add 512 to the payload size.
> + */
> +static int insitu_comp_io_range_comp(struct insitu_comp_info *info,
> +     void *comp_data, unsigned int *comp_len, void *decomp_data,
> +     unsigned int decomp_len, bool do_comp)
> +{
> +     struct crypto_comp *tfm;
> +     u32 *addr;
> +     unsigned int actual_comp_len;
> +     int ret;
> +
> +     if (do_comp) {
> +             actual_comp_len = *comp_len;
> +
> +             tfm = info->tfm[get_cpu()];
> +             ret = crypto_comp_compress(tfm, decomp_data, decomp_len,
> +                     comp_data, &actual_comp_len);
> +             put_cpu();
> +
> +             atomic64_add(decomp_len, &info->uncompressed_write_size);
> +             if (ret || decomp_len < actual_comp_len + sizeof(u32) + 512) {
> +                     *comp_len = decomp_len;
> +                     atomic64_add(*comp_len, &info->compressed_write_size);
> +                     return 1;
> +             }
> +
> +             *comp_len = round_up(actual_comp_len, 512);
> +             if (*comp_len - actual_comp_len < sizeof(u32))
> +                     *comp_len += 512;
> +             atomic64_add(*comp_len, &info->compressed_write_size);
> +             addr = comp_data + *comp_len;
> +             addr--;
> +             *addr = cpu_to_le32(actual_comp_len);
> +     } else {
> +             if (*comp_len == decomp_len)
> +                     return 1;
> +             addr = comp_data + *comp_len;
> +             addr--;
> +             actual_comp_len = le32_to_cpu(*addr);
> +
> +             tfm = info->tfm[get_cpu()];
> +             ret = crypto_comp_decompress(tfm, comp_data, actual_comp_len,
> +                     decomp_data, &decomp_len);
> +             put_cpu();
> +             if (ret)
> +                     return -EINVAL;
> +     }
> +     return 0;
> +}
> +
> +/*
> + * compressed data is updated. We decompress it and fill req. If there is no
> + * valid compressed data, we just zero req
> + */
> +static void insitu_comp_handle_read_decomp(struct insitu_comp_req *req)
> +{
> +     struct insitu_comp_io_range *io;
> +     off_t req_off = 0;
> +     int ret;
> +
> +     req->stage = STAGE_READ_DECOMP;
> +
> +     if (req->result)
> +             return;
> +
> +     list_for_each_entry(io, &req->all_io, next) {
> +             ssize_t dst_off = 0, src_off = 0, len;
> +
> +             io->io_region.sector -= req->info->data_start;
> +
> +             /* Do decomp here */
> +             ret = insitu_comp_io_range_comp(req->info, io->comp_data,
> +                     &io->comp_len, io->decomp_data, io->decomp_len, false);
> +             if (ret < 0) {
> +                     req->result = -EIO;
> +                     return;
> +             }
> +
> +             if (io->io_region.sector >= insitu_req_start_sector(req))
> +                     dst_off = (io->io_region.sector - 
> insitu_req_start_sector(req))
> +                             << 9;
> +             else
> +                     src_off = (insitu_req_start_sector(req) - 
> io->io_region.sector)
> +                             << 9;
> +             len = min_t(ssize_t, io->decomp_len - src_off,
> +                     (insitu_req_sectors(req) << 9) - dst_off);
> +
> +             /* io range in all_io list is ordered for read IO */
> +             while (req_off != dst_off) {
> +                     ssize_t size = min_t(ssize_t, PAGE_SIZE,
> +                                     dst_off - req_off);
> +                     insitu_comp_req_copy(req, req_off,
> +                             empty_zero_page, size, false);
> +                     req_off += size;
> +             }
> +
> +             if (ret == 1) /* uncompressed, valid data is in .comp_data */
> +                     insitu_comp_req_copy(req, dst_off,
> +                                     io->comp_data + src_off, len, false);
> +             else
> +                     insitu_comp_req_copy(req, dst_off,
> +                                     io->decomp_data + src_off, len, false);
> +             req_off = dst_off + len;
> +     }
> +
> +     while (req_off != (insitu_req_sectors(req) << 9)) {
> +             ssize_t size = min_t(ssize_t, PAGE_SIZE,
> +                     (insitu_req_sectors(req) << 9) - req_off);
> +             insitu_comp_req_copy(req, req_off, empty_zero_page,
> +                     size, false);
> +             req_off += size;
> +     }
> +}
> +
> +/*
> + * read one extent data from disk. The extent starts from block @block and 
> has
> + * @data_sectors data
> + */
> +static void insitu_comp_read_one_extent(struct insitu_comp_req *req, u64 
> block,
> +     u16 logical_sectors, u16 data_sectors)
> +{
> +     struct insitu_comp_io_range *io;
> +
> +     io = insitu_comp_create_io_range(req, data_sectors << 9,
> +             logical_sectors << 9);
> +     if (!io) {
> +             req->result = -EIO;
> +             return;
> +     }
> +
> +     insitu_comp_get_req(req);
> +     list_add_tail(&io->next, &req->all_io);
> +
> +     io->io_region.sector = (block << INSITU_COMP_BLOCK_SECTOR_SHIFT) +
> +                             req->info->data_start;
> +     io->io_region.count = data_sectors;
> +
> +     io->io_req.bi_rw = READ;
> +     dm_io(&io->io_req, 1, &io->io_region, NULL);
> +}
> +
> +static void insitu_comp_handle_read_read_existing(struct insitu_comp_req 
> *req)
> +{
> +     u64 block_index, first_block_index;
> +     u16 logical_sectors, data_sectors;
> +
> +     req->stage = STAGE_READ_EXISTING;
> +
> +     block_index = insitu_comp_sector_to_block(insitu_req_start_sector(req));
> +again:
> +     insitu_comp_get_extent(req->info, block_index, &first_block_index,
> +             &logical_sectors, &data_sectors);
> +     if (data_sectors > 0)
> +             insitu_comp_read_one_extent(req, first_block_index,
> +                     logical_sectors, data_sectors);
> +
> +     if (req->result)
> +             return;
> +
> +     block_index = first_block_index + (logical_sectors >>
> +                             INSITU_COMP_BLOCK_SECTOR_SHIFT);
> +     /* the request might cover several extents */
> +     if ((block_index << INSITU_COMP_BLOCK_SECTOR_SHIFT) <
> +                     insitu_req_end_sector(req))
> +             goto again;
> +
> +     /* A shortcut if all data is in already */
> +     if (list_empty(&req->all_io))
> +             insitu_comp_handle_read_decomp(req);
> +}
> +
> +static void insitu_comp_handle_read_request(struct insitu_comp_req *req)
> +{
> +     insitu_comp_get_req(req);
> +
> +     if (req->stage == STAGE_INIT) {
> +             if (!insitu_comp_lock_req_range(req)) {
> +                     insitu_comp_put_req(req);
> +                     return;
> +             }
> +
> +             insitu_comp_handle_read_read_existing(req);
> +     } else if (req->stage == STAGE_READ_EXISTING)
> +             insitu_comp_handle_read_decomp(req);
> +
> +     insitu_comp_put_req(req);
> +}
> +
> +static void insitu_comp_write_meta_done(void *context, unsigned long error)
> +{
> +     struct insitu_comp_req *req = context;
> +     insitu_comp_put_req(req);
> +}
> +
> +static u64 insitu_comp_block_meta_page_index(u64 block, bool end)
> +{
> +     u64 bits = block * INSITU_COMP_META_BITS - !!end;
> +     /* (1 << 3) bits per byte */
> +     return bits >> (3 + PAGE_SHIFT);
> +}
> +
> +/*
> + * the request covers some extents partially. Decompress data of the extents,
> + * compress remaining valid data, and finally write them out
> + */
> +static int insitu_comp_handle_write_modify(struct insitu_comp_io_range *io,
> +     u64 *meta_start, u64 *meta_end, bool *handle_req)
> +{
> +     struct insitu_comp_req *req = io->req;
> +     sector_t start, count;
> +     unsigned int comp_len;
> +     off_t offset;
> +     u64 page_index;
> +     int ret;
> +
> +     io->io_region.sector -= req->info->data_start;
> +
> +     /* decompress original data */
> +     ret = insitu_comp_io_range_comp(req->info, io->comp_data, &io->comp_len,
> +                     io->decomp_data, io->decomp_len, false);
> +     if (ret < 0) {
> +             req->result = -EINVAL;
> +             return -EIO;
> +     }
> +
> +     start = io->io_region.sector;
> +     count = io->decomp_len >> 9;
> +     if (start < insitu_req_start_sector(req) && start + count >
> +                                     insitu_req_end_sector(req)) {
> +             /* we don't split an extent */
> +             if (ret == 1) {
> +                     memcpy(io->decomp_data, io->comp_data, io->decomp_len);
> +                     insitu_comp_req_copy(req, 0,
> +                        io->decomp_data + ((insitu_req_start_sector(req) - 
> start) <<
> +                        9), insitu_req_sectors(req) << 9, true);
> +             } else {
> +                     insitu_comp_req_copy(req, 0,
> +                        io->decomp_data + ((insitu_req_start_sector(req) - 
> start) <<
> +                        9), insitu_req_sectors(req) << 9, true);
> +                     kfree(io->comp_data);
> +                     /* New compressed len might be bigger */
> +                     io->comp_data = kmalloc(insitu_comp_compressor_len(
> +                             req->info, io->decomp_len), GFP_NOIO);
> +                     io->comp_len = io->decomp_len;
> +                     if (!io->comp_data) {
> +                             req->result = -ENOMEM;
> +                             return -EIO;
> +                     }
> +                     io->io_req.mem.ptr.addr = io->comp_data;
> +             }
> +             /* need compress data */
> +             ret = 0;
> +             offset = 0;
> +             *handle_req = false;
> +     } else if (start < insitu_req_start_sector(req)) {
> +             count = insitu_req_start_sector(req) - start;
> +             offset = 0;
> +     } else {
> +             offset = insitu_req_end_sector(req) - start;
> +             start = insitu_req_end_sector(req);
> +             count = count - offset;
> +     }
> +
> +     /* Original data is uncompressed, we don't need writeback */
> +     if (ret == 1) {
> +             comp_len = count << 9;
> +             goto handle_meta;
> +     }
> +
> +     /* assume compress less data uses less space (at least 4k lsess data) */
> +     comp_len = io->comp_len;
> +     ret = insitu_comp_io_range_comp(req->info, io->comp_data, &comp_len,
> +             io->decomp_data + (offset << 9), count << 9, true);
> +     if (ret < 0) {
> +             req->result = -EIO;
> +             return -EIO;
> +     }
> +
> +     insitu_comp_get_req(req);
> +     if (ret == 1)
> +             io->io_req.mem.ptr.addr = io->decomp_data + (offset << 9);
> +     io->io_region.count = comp_len >> 9;
> +     io->io_region.sector = start + req->info->data_start;
> +
> +     io->io_req.bi_rw = insitu_req_rw(req);
> +     dm_io(&io->io_req, 1, &io->io_region, NULL);
> +handle_meta:
> +     insitu_comp_set_extent(req, start >> INSITU_COMP_BLOCK_SECTOR_SHIFT,
> +             count >> INSITU_COMP_BLOCK_SECTOR_SHIFT, comp_len >> 9);
> +
> +     page_index = insitu_comp_block_meta_page_index(start >>
> +                                     INSITU_COMP_BLOCK_SECTOR_SHIFT, false);
> +     if (*meta_start > page_index)
> +             *meta_start = page_index;
> +     page_index = insitu_comp_block_meta_page_index(
> +             (start + count) >> INSITU_COMP_BLOCK_SECTOR_SHIFT, true);
> +     if (*meta_end < page_index)
> +             *meta_end = page_index;
> +     return 0;
> +}
> +
> +/* Compress data and write it out */
> +static void insitu_comp_handle_write_comp(struct insitu_comp_req *req)
> +{
> +     struct insitu_comp_io_range *io;
> +     sector_t count;
> +     unsigned int comp_len;
> +     u64 meta_start = -1L, meta_end = 0, page_index;
> +     int ret;
> +     bool handle_req = true;
> +
> +     req->stage = STAGE_WRITE_COMP;
> +
> +     if (req->result)
> +             return;
> +
> +     list_for_each_entry(io, &req->all_io, next) {
> +             if (insitu_comp_handle_write_modify(io, &meta_start, &meta_end,
> +                                             &handle_req))
> +                     return;
> +     }
> +
> +     if (!handle_req)
> +             goto update_meta;
> +
> +     count = insitu_req_sectors(req);
> +     io = insitu_comp_create_io_range(req, count << 9, count << 9);
> +     if (!io) {
> +             req->result = -EIO;
> +             return;
> +     }
> +     insitu_comp_req_copy(req, 0, io->decomp_data, count << 9, true);
> +
> +     /* compress data */
> +     comp_len = io->comp_len;
> +     ret = insitu_comp_io_range_comp(req->info, io->comp_data, &comp_len,
> +             io->decomp_data, count << 9, true);
> +     if (ret < 0) {
> +             insitu_comp_free_io_range(io);
> +             req->result = -EIO;
> +             return;
> +     }
> +
> +     insitu_comp_get_req(req);
> +     list_add_tail(&io->next, &req->all_io);
> +     io->io_region.sector = insitu_req_start_sector(req) + 
> req->info->data_start;
> +     if (ret == 1)
> +             io->io_req.mem.ptr.addr = io->decomp_data;
> +     io->io_region.count = comp_len >> 9;
> +     io->io_req.bi_rw = insitu_req_rw(req);
> +     dm_io(&io->io_req, 1, &io->io_region, NULL);
> +     insitu_comp_set_extent(req,
> +             insitu_req_start_sector(req) >> INSITU_COMP_BLOCK_SECTOR_SHIFT,
> +             count >> INSITU_COMP_BLOCK_SECTOR_SHIFT, comp_len >> 9);
> +
> +     page_index = insitu_comp_block_meta_page_index(
> +             insitu_req_start_sector(req) >> INSITU_COMP_BLOCK_SECTOR_SHIFT, 
> false);
> +     if (meta_start > page_index)
> +             meta_start = page_index;
> +     page_index = insitu_comp_block_meta_page_index(
> +             (insitu_req_start_sector(req) + count) >> 
> INSITU_COMP_BLOCK_SECTOR_SHIFT,
> +             true);
> +     if (meta_end < page_index)
> +             meta_end = page_index;
> +update_meta:
> +     if (req->info->write_mode == INSITU_COMP_WRITE_THROUGH ||
> +                                             (insitu_req_rw(req) & REQ_FUA)) 
> {
> +             insitu_comp_get_req(req);
> +             insitu_comp_write_meta(req->info, meta_start, meta_end + 1, req,
> +                     insitu_comp_write_meta_done, insitu_req_rw(req));
> +     }
> +}
> +
> +/* request might cover some extents partially, read them first */
> +static void insitu_comp_handle_write_read_existing(struct insitu_comp_req 
> *req)
> +{
> +     u64 block_index, first_block_index;
> +     u16 logical_sectors, data_sectors;
> +
> +     req->stage = STAGE_READ_EXISTING;
> +
> +     block_index = insitu_comp_sector_to_block(insitu_req_start_sector(req));
> +     insitu_comp_get_extent(req->info, block_index, &first_block_index,
> +             &logical_sectors, &data_sectors);
> +     if (data_sectors > 0 && (first_block_index < block_index ||
> +         first_block_index + insitu_comp_sector_to_block(logical_sectors) >
> +         insitu_comp_sector_to_block(insitu_req_end_sector(req))))
> +             insitu_comp_read_one_extent(req, first_block_index,
> +                     logical_sectors, data_sectors);
> +
> +     if (req->result)
> +             return;
> +
> +     if (first_block_index + insitu_comp_sector_to_block(logical_sectors) >=
> +         insitu_comp_sector_to_block(insitu_req_end_sector(req)))
> +             goto out;
> +
> +     block_index = insitu_comp_sector_to_block(insitu_req_end_sector(req)) - 
> 1;
> +     insitu_comp_get_extent(req->info, block_index, &first_block_index,
> +             &logical_sectors, &data_sectors);
> +     if (data_sectors > 0 &&
> +         first_block_index + insitu_comp_sector_to_block(logical_sectors) >
> +         block_index + 1)
> +             insitu_comp_read_one_extent(req, first_block_index,
> +                     logical_sectors, data_sectors);
> +
> +     if (req->result)
> +             return;
> +out:
> +     if (list_empty(&req->all_io))
> +             insitu_comp_handle_write_comp(req);
> +}
> +
> +static void insitu_comp_handle_write_request(struct insitu_comp_req *req)
> +{
> +     insitu_comp_get_req(req);
> +
> +     if (req->stage == STAGE_INIT) {
> +             if (!insitu_comp_lock_req_range(req)) {
> +                     insitu_comp_put_req(req);
> +                     return;
> +             }
> +
> +             insitu_comp_handle_write_read_existing(req);
> +     } else if (req->stage == STAGE_READ_EXISTING)
> +             insitu_comp_handle_write_comp(req);
> +
> +     insitu_comp_put_req(req);
> +}
> +
> +/* For writeback mode */
> +static void insitu_comp_handle_flush_request(struct insitu_comp_req *req)
> +{
> +     struct writeback_flush_data wb;
> +
> +     atomic_set(&wb.cnt, 1);
> +     init_completion(&wb.complete);
> +
> +     insitu_comp_flush_dirty_meta(req->info, &wb);
> +
> +     writeback_flush_io_done(&wb, 0);
> +     wait_for_completion(&wb.complete);
> +
> +     insitu_req_endio(req, 0);
> +}
> +
> +static void insitu_comp_handle_request(struct insitu_comp_req *req)
> +{
> +     if (insitu_req_rw(req) & REQ_FLUSH)
> +             insitu_comp_handle_flush_request(req);
> +     else if (insitu_req_rw(req) & REQ_WRITE)
> +             insitu_comp_handle_write_request(req);
> +     else
> +             insitu_comp_handle_read_request(req);
> +}
> +
> +static void insitu_comp_do_request_work(struct work_struct *work)
> +{
> +     struct insitu_comp_io_worker *worker = container_of(work,
> +                     struct insitu_comp_io_worker, work);
> +     LIST_HEAD(list);
> +     struct insitu_comp_req *req;
> +     struct blk_plug plug;
> +     bool repeat;
> +
> +     blk_start_plug(&plug);
> +again:
> +     spin_lock_irq(&worker->lock);
> +     list_splice_init(&worker->pending, &list);
> +     spin_unlock_irq(&worker->lock);
> +
> +     repeat = !list_empty(&list);
> +     while (!list_empty(&list)) {
> +             req = list_first_entry(&list, struct insitu_comp_req, sibling);
> +             list_del(&req->sibling);
> +
> +             insitu_comp_handle_request(req);
> +     }
> +     if (repeat)
> +             goto again;
> +     blk_finish_plug(&plug);
> +}
> +
> +static int insitu_comp_map(struct dm_target *ti, struct bio *bio)
> +{
> +     struct insitu_comp_info *info = ti->private;
> +     struct insitu_comp_req *req;
> +
> +     req = dm_per_bio_data(bio, sizeof(struct insitu_comp_req));
> +
> +     if ((bio->bi_rw & REQ_FLUSH) &&
> +                     info->write_mode == INSITU_COMP_WRITE_THROUGH) {
> +             bio->bi_bdev = info->dev->bdev;
> +             return DM_MAPIO_REMAPPED;
> +     }
> +
> +     req->bio = bio;
> +     req->info = info;
> +     atomic_set(&req->io_pending, 0);
> +     INIT_LIST_HEAD(&req->all_io);
> +     req->result = 0;
> +     req->stage = STAGE_INIT;
> +
> +     req->cpu = raw_smp_processor_id();
> +     insitu_comp_queue_req(info, req);
> +
> +     return DM_MAPIO_SUBMITTED;
> +}
> +
> +/*
> + * INFO: uncompressed_data_size compressed_data_size metadata_size
> + * TABLE: writethrough/writeback commit_delay
> + */
> +static void insitu_comp_status(struct dm_target *ti, status_type_t type,
> +                       unsigned status_flags, char *result, unsigned maxlen)
> +{
> +     struct insitu_comp_info *info = ti->private;
> +     unsigned int sz = 0;
> +
> +     switch (type) {
> +     case STATUSTYPE_INFO:
> +             DMEMIT("%lu %lu %lu",
> +                     atomic64_read(&info->uncompressed_write_size),
> +                     atomic64_read(&info->compressed_write_size),
> +                     atomic64_read(&info->meta_write_size));
> +             break;
> +     case STATUSTYPE_TABLE:
> +             if (info->write_mode == INSITU_COMP_WRITE_BACK)
> +                     DMEMIT("%s %s %d", info->dev->name, "writeback",
> +                             info->writeback_delay);
> +             else
> +                     DMEMIT("%s %s", info->dev->name, "writethrough");
> +             break;
> +     }
> +}
> +
> +static int insitu_comp_iterate_devices(struct dm_target *ti,
> +                               iterate_devices_callout_fn fn, void *data)
> +{
> +     struct insitu_comp_info *info = ti->private;
> +
> +     return fn(ti, info->dev, info->data_start,
> +             info->data_blocks << INSITU_COMP_BLOCK_SECTOR_SHIFT, data);
> +}
> +
> +static void insitu_comp_io_hints(struct dm_target *ti,
> +                         struct queue_limits *limits)
> +{
> +     /* No blk_limits_logical_block_size */
> +     limits->logical_block_size = limits->physical_block_size =
> +             limits->io_min = INSITU_COMP_BLOCK_SIZE;
> +     blk_limits_max_hw_sectors(limits, INSITU_COMP_MAX_SIZE >> 9);
> +}
> +
> +static int insitu_comp_merge(struct dm_target *ti, struct bvec_merge_data 
> *bvm,
> +                     struct bio_vec *biovec, int max_size)
> +{
> +     /* Guarantee request can only cover one aligned 128k range */
> +     return min_t(int, max_size, INSITU_COMP_MAX_SIZE - bvm->bi_size -
> +                     ((bvm->bi_sector << 9) % INSITU_COMP_MAX_SIZE));
> +}
> +
> +static struct target_type insitu_comp_target = {
> +     .name   = "insitu_comp",
> +     .version = {1, 0, 0},
> +     .module = THIS_MODULE,
> +     .ctr    = insitu_comp_ctr,
> +     .dtr    = insitu_comp_dtr,
> +     .map    = insitu_comp_map,
> +     .status = insitu_comp_status,
> +     .iterate_devices = insitu_comp_iterate_devices,
> +     .io_hints = insitu_comp_io_hints,
> +     .merge = insitu_comp_merge,
> +};
> +
> +static int __init insitu_comp_init(void)
> +{
> +     int r;
> +
> +     for (r = 0; r < ARRAY_SIZE(compressors); r++)
> +             if (crypto_has_comp(compressors[r].name, 0, 0))
> +                     break;
> +     if (r >= ARRAY_SIZE(compressors)) {
> +             DMWARN("No crypto compressors are supported");
> +             return -EINVAL;
> +     }
> +
> +     default_compressor = r;
> +
> +     r = -ENOMEM;
> +     insitu_comp_io_range_cachep = kmem_cache_create("insitu_comp_io_range",
> +             sizeof(struct insitu_comp_io_range), 0, 0, NULL);
> +     if (!insitu_comp_io_range_cachep) {
> +             DMWARN("Can't create io_range cache");
> +             goto err;
> +     }
> +
> +     insitu_comp_meta_io_cachep = kmem_cache_create("insitu_comp_meta_io",
> +             sizeof(struct insitu_comp_meta_io), 0, 0, NULL);
> +     if (!insitu_comp_meta_io_cachep) {
> +             DMWARN("Can't create meta_io cache");
> +             goto err;
> +     }
> +
> +     insitu_comp_wq = alloc_workqueue("insitu_comp_io",
> +             WQ_UNBOUND|WQ_MEM_RECLAIM|WQ_CPU_INTENSIVE, 0);
> +     if (!insitu_comp_wq) {
> +             DMWARN("Can't create io workqueue");
> +             goto err;
> +     }
> +
> +     r = dm_register_target(&insitu_comp_target);
> +     if (r < 0) {
> +             DMWARN("target registration failed");
> +             goto err;
> +     }
> +
> +     for_each_possible_cpu(r) {
> +             INIT_LIST_HEAD(&insitu_comp_io_workers[r].pending);
> +             spin_lock_init(&insitu_comp_io_workers[r].lock);
> +             INIT_WORK(&insitu_comp_io_workers[r].work,
> +                     insitu_comp_do_request_work);
> +     }
> +     return 0;
> +err:
> +     if (insitu_comp_io_range_cachep)
> +             kmem_cache_destroy(insitu_comp_io_range_cachep);
> +     if (insitu_comp_meta_io_cachep)
> +             kmem_cache_destroy(insitu_comp_meta_io_cachep);
> +     if (insitu_comp_wq)
> +             destroy_workqueue(insitu_comp_wq);
> +
> +     return r;
> +}
> +
> +static void __exit insitu_comp_exit(void)
> +{
> +     dm_unregister_target(&insitu_comp_target);
> +     kmem_cache_destroy(insitu_comp_io_range_cachep);
> +     kmem_cache_destroy(insitu_comp_meta_io_cachep);
> +     destroy_workqueue(insitu_comp_wq);
> +}
> +
> +module_init(insitu_comp_init);
> +module_exit(insitu_comp_exit);
> +
> +MODULE_AUTHOR("Shaohua Li <s...@kernel.org>");
> +MODULE_DESCRIPTION(DM_NAME " target with insitu data compression for SSD");
> +MODULE_LICENSE("GPL");
> Index: linux/drivers/md/dm-insitu-comp.h
> ===================================================================
> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> +++ linux/drivers/md/dm-insitu-comp.h 2014-02-17 18:37:07.108425465 +0800
> @@ -0,0 +1,158 @@
> +#ifndef __DM_INSITU_COMPRESSION_H__
> +#define __DM_INSITU_COMPRESSION_H__
> +#include <linux/types.h>
> +
> +struct insitu_comp_super_block {
> +     __le64 magic;
> +     __le64 version;
> +     __le64 meta_blocks;
> +     __le64 data_blocks;
> +     u8 comp_alg;
> +} __attribute__((packed));
> +
> +#define INSITU_COMP_SUPER_MAGIC 0x106526c206506c09
> +#define INSITU_COMP_VERSION 1
> +#define INSITU_COMP_ALG_LZO 0
> +#define INSITU_COMP_ALG_ZLIB 1
> +
> +#ifdef __KERNEL__
> +struct insitu_comp_compressor_data {
> +     char *name;
> +     int (*comp_len)(int comp_len);
> +};
> +
> +static inline int lzo_comp_len(int comp_len)
> +{
> +     return lzo1x_worst_compress(comp_len);
> +}
> +
> +/*
> + * Minium logical sector size of this target is 4096 byte, which is a block.
> + * Data of a block is compressed. Compressed data is round up to 512B, which 
> is
> + * the payload. For each block, we have 5 bits meta data. bit 0 - 3 stands
> + * payload length. 0 - 8 sectors. If compressed payload length is 8 sectors, 
> we
> + * just store uncompressed data. Actual compressed data length is stored at 
> the
> + * last 32 bits of payload if data is compressed. In disk, payload is stored 
> at
> + * the begining of logical sector of the block. If IO size is bigger than one
> + * block, we store the whole data as an extent. Bit 4 stands tail for an
> + * extent. Max allowed extent size is 128k.
> + */
> +#define INSITU_COMP_BLOCK_SIZE 4096
> +#define INSITU_COMP_BLOCK_SHIFT 12
> +#define INSITU_COMP_BLOCK_SECTOR_SHIFT (INSITU_COMP_BLOCK_SHIFT - 9)
> +
> +#define INSITU_COMP_MIN_SIZE 4096
> +/* Change this should change HASH_LOCK_SHIFT too */
> +#define INSITU_COMP_MAX_SIZE (128 * 1024)
> +
> +#define INSITU_COMP_LENGTH_MASK ((1 << 4) - 1)
> +#define INSITU_COMP_TAIL_MASK (1 << 4)
> +#define INSITU_COMP_META_BITS 5
> +
> +#define INSITU_COMP_META_START_SECTOR (INSITU_COMP_BLOCK_SIZE >> 9)
> +
> +enum INSITU_COMP_WRITE_MODE {
> +     INSITU_COMP_WRITE_BACK,
> +     INSITU_COMP_WRITE_THROUGH,
> +};
> +
> +/*
> + * request can cover one aligned 128k (4k * (1 << 5)) range. Since maxium
> + * request size is 128k, we only need take one lock for each request
> + */
> +#define HASH_LOCK_SHIFT 5
> +
> +#define BITMAP_HASH_SHIFT 9
> +#define BITMAP_HASH_MASK ((1 << BITMAP_HASH_SHIFT) - 1)
> +#define BITMAP_HASH_LEN (1 << BITMAP_HASH_SHIFT)
> +
> +struct insitu_comp_hash_lock {
> +     int io_running;
> +     spinlock_t wait_lock;
> +     struct list_head wait_list;
> +};
> +
> +struct insitu_comp_info {
> +     struct dm_target *ti;
> +     struct dm_dev *dev;
> +
> +     int comp_alg;
> +     struct crypto_comp *tfm[NR_CPUS];
> +
> +     sector_t data_start;
> +     u64 data_blocks;
> +
> +     char *meta_bitmap;
> +     u64 meta_bitmap_bits;
> +     u64 meta_bitmap_pages;
> +     struct insitu_comp_hash_lock bitmap_locks[BITMAP_HASH_LEN];
> +
> +     enum INSITU_COMP_WRITE_MODE write_mode;
> +     unsigned int writeback_delay; /* second unit */
> +     struct task_struct *writeback_tsk;
> +     struct dm_io_client *io_client;
> +
> +     atomic64_t compressed_write_size;
> +     atomic64_t uncompressed_write_size;
> +     atomic64_t meta_write_size;
> +};
> +
> +struct insitu_comp_meta_io {
> +     struct dm_io_request io_req;
> +     struct dm_io_region io_region;
> +     void *data;
> +     void (*fn)(void *data, unsigned long error);
> +};
> +
> +struct insitu_comp_io_range {
> +     struct dm_io_request io_req;
> +     struct dm_io_region io_region;
> +     void *decomp_data;
> +     unsigned int decomp_len;
> +     void *comp_data;
> +     unsigned int comp_len; /* For write, this is estimated */
> +     struct list_head next;
> +     struct insitu_comp_req *req;
> +};
> +
> +enum INSITU_COMP_REQ_STAGE {
> +     STAGE_INIT,
> +     STAGE_READ_EXISTING,
> +     STAGE_READ_DECOMP,
> +     STAGE_WRITE_COMP,
> +     STAGE_DONE,
> +};
> +
> +struct insitu_comp_req {
> +     struct bio *bio;
> +     struct insitu_comp_info *info;
> +     struct list_head sibling;
> +
> +     struct list_head all_io;
> +     atomic_t io_pending;
> +     enum INSITU_COMP_REQ_STAGE stage;
> +
> +     struct insitu_comp_hash_lock *lock;
> +     int result;
> +
> +     int cpu;
> +};
> +
> +#define insitu_req_start_sector(req) (req->bio->bi_iter.bi_sector)
> +#define insitu_req_end_sector(req) (bio_end_sector(req->bio))
> +#define insitu_req_rw(req) (req->bio->bi_rw)
> +#define insitu_req_sectors(req) (bio_sectors(req->bio))
> +
> +static inline void insitu_req_endio(struct insitu_comp_req *req, int error)
> +{
> +     bio_endio(req->bio, error);
> +}
> +
> +struct insitu_comp_io_worker {
> +     struct list_head pending;
> +     spinlock_t lock;
> +     struct work_struct work;
> +};
> +#endif
> +
> +#endif
> Index: linux/Documentation/device-mapper/insitu-comp.txt
> ===================================================================
> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> +++ linux/Documentation/device-mapper/insitu-comp.txt 2014-02-17 
> 17:34:45.427464765 +0800
> @@ -0,0 +1,50 @@
> +This is a simple DM target supporting compression for SSD only. Under layer 
> SSD
> +must support 512B sector size, the target only supports 4k sector size.
> +
> +Disk layout:
> +|super|...meta...|..data...|
> +
> +Store unit is 4k (a block). Super is 1 block, which stores meta and data size
> +and compression algorithm. Meta is a bitmap. For each data block, there are 5
> +bits meta.
> +
> +Data:
> +Data of a block is compressed. Compressed data is round up to 512B, which is
> +the payload. In disk, payload is stored at the begining of logical sector of
> +the block. Let's look at an example. Say we store data to block A, which is 
> in
> +sector B(A*8), its orginal size is 4k, compressed size is 1500. Compressed 
> data
> +(CD) will use 3 sectors (512B). The 3 sectors are the payload. Payload will 
> be
> +stored at sector B.
> +
> +---------------------------------------------------
> +... | CD1 | CD2 | CD3 |   |   |   |   |    | ...
> +---------------------------------------------------
> +    ^B    ^B+1  ^B+2                  ^B+7 ^B+8
> +
> +For this block, we will not use sector B+3 to B+7 (a hole). We use 4 meta 
> bits
> +to present payload size. The compressed size (1500) isn't stored in meta
> +directly. Instead, we store it at the last 32bits of payload. In this 
> example,
> +we store it at the end of sector B+2. If compressed size + sizeof(32bits)
> +crosses a sector, payload size will increase one sector. If payload uses 8
> +sectors, we store uncompressed data directly.
> +
> +If IO size is bigger than one block, we can store the data as an extent. Data
> +of the whole extent will compressed and stored in the similar way like above.
> +The first block of the extent is the head, all others are the tail. If extent
> +is 1 block, the block is head. We have 1 bit of meta to present if a block is
> +head or tail. If 4 meta bits of head block can't store extent payload size, 
> we
> +will borrow tail block meta bits to store payload size. Max allowd extent 
> size
> +is 128k, so we don't compress/decompress too big size data.
> +
> +Meta:
> +Modifying data will modify meta too. Meta will be written(flush) to disk
> +depending on meta write policy. We support writeback and writethrough mode. 
> In
> +writeback mode, meta will be written to disk in an interval or a FLUSH 
> request.
> +In writethrough mode, data and meta data will be written to disk together.
> +
> +=========================
> +Parameters: <dev> [<writethrough>|<writeback> <meta_commit_delay>]
> +   <dev>: underlying device
> +   <writethrough>: metadata flush to disk with writetrough mode
> +   <writeback>: metadata flush to disk with writeback mode
> +   <meta_commit_delay>: metadata flush to disk interval in writeback mode
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch v3]DM: dm-insitu-comp: a compressed DM target for SSD

Reply via email to