On Mon, Apr 22, 2024 at 06:05:40PM +0800, Yang Yang wrote:
> __send_empty_flush() sends empty flush bios to every target in the
> dm_table. However, if the num_targets exceeds the number of block
> devices in the dm_table's device list, it could lead to multiple
> invocations of __send_duplicate_bios() for the same block device.
> Typically, a single thread sending numerous empty flush bios to one
> block device is redundant, as these bios are likely to be merged by the
> flush state machine. In scenarios where num_targets significantly
> outweighs the number of block devices, such behavior may result in a
> noteworthy decrease in performance.
> 
> This issue can be reproduced using this command line:
>   for i in {0..1023}; do
>     echo $((8000*$i)) 8000 linear /dev/sda2 $((16384*$i))
>   done | dmsetup create example

Only _one_ dm_dev will be created for this example: /dev/sda2
So your bitmap is looking at a single bit.

> With this fix, a random write with fsync workload executed with the
> following fio command:
> 
>   fio --group_reporting --name=benchmark --filename=/dev/mapper/example \
>       --ioengine=sync --invalidate=1 --numjobs=16 --rw=randwrite \
>       --blocksize=4k --size=2G --time_based --runtime=30 --fdatasync=1
> 
> results in an increase from 857 KB/s to 30.8 MB/s of the write
> throughput (3580% increase).
> 
> Signed-off-by: Yang Yang <[email protected]>

I'm including my original fine-grained review comments inlined below
BUT, having wasted time reviewing this patch I'm left with the
conclusions:

1) this patch has serious issues.
2) it fixes an issue with a toy 'example' but ignores that not all
   targets are linear, each disparate target could have their own
   reason(s) for actually _needing_ the redundant flushes.

I'm inclined to never take this type of change.

> ---
>  drivers/md/dm-core.h          |  1 +
>  drivers/md/dm-table.c         |  7 +++++
>  drivers/md/dm.c               | 59 +++++++++++++++++++++++++++++++++++
>  include/linux/device-mapper.h |  6 ++++
>  4 files changed, 73 insertions(+)
> 
> diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
> index e6757a30dcca..7e3f2168289f 100644
> --- a/drivers/md/dm-core.h
> +++ b/drivers/md/dm-core.h
> @@ -217,6 +217,7 @@ struct dm_table {
>       /* a list of devices used by this table */
>       struct list_head devices;
>       struct rw_semaphore devices_lock;
> +     unsigned short num_devices;
>  
>       /* events get handed up using this callback */
>       void (*event_fn)(void *data);
> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> index 41f1d731ae5a..ddc60e498afb 100644
> --- a/drivers/md/dm-table.c
> +++ b/drivers/md/dm-table.c
> @@ -2133,6 +2133,8 @@ void dm_table_postsuspend_targets(struct dm_table *t)
>  
>  int dm_table_resume_targets(struct dm_table *t)
>  {
> +     struct list_head *devices = dm_table_get_devices(t);
> +     struct dm_dev_internal *dd;
>       unsigned int i;
>       int r = 0;
>  
> @@ -2159,6 +2161,11 @@ int dm_table_resume_targets(struct dm_table *t)
>                       ti->type->resume(ti);
>       }
>  
> +     t->num_devices = 0;
> +
> +     list_for_each_entry(dd, devices, list)
> +             dd->dm_dev->index = ++(t->num_devices);
> +
>       return 0;
>  }
>  
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 56aa2a8b9d71..7297235291f6 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -48,6 +48,8 @@
>   */
>  #define REQ_DM_POLL_LIST     REQ_DRV
>  
> +#define DM_MAX_TABLE_DEVICES 1024
> +

This name is too general. This limit isn't imposed for anything other
than bounding the size of the bitmap used to avoid redundant flushes.

So maybe rename to: DM_MAX_REDUNDANT_FLUSH_BITMAP_DEVICES

>  static const char *_name = DM_NAME;
>  
>  static unsigned int major;
> @@ -1543,10 +1545,38 @@ static unsigned int __send_duplicate_bios(struct 
> clone_info *ci, struct dm_targe
>       return ret;
>  }
>  
> +static inline bool has_redundant_flush(struct dm_table *t,
> +             unsigned long **bitmap)
> +{
> +     if (t->num_devices < t->num_targets) {
> +             /* Add a limit here to prevent excessive memory usage for 
> bitmaps */
> +             if (t->num_devices >= DM_MAX_TABLE_DEVICES)
> +                     return false;

OK, in practice I'm not aware of tables that require such an excessive
amount of devices.

> +             /* dm_dev's index starts from 1, so need plus 1 here */
> +             *bitmap = bitmap_zalloc(t->num_devices + 1, GFP_KERNEL);
> +             if (*bitmap)
> +                     return true;
> +     }

This method is being used in the IO path, so you cannot use
GFP_KERNEL.  Please switch to GFP_NOIO.

> +
> +     return false;
> +}
> +
> +static int dm_get_dev_index(struct dm_target *ti, struct dm_dev *dev,
> +                                  sector_t start, sector_t len, void *data)
> +{
> +     unsigned short *index = data;
> +     *index = dev->index;
> +     return 0;
> +}
> +
>  static void __send_empty_flush(struct clone_info *ci)
>  {
>       struct dm_table *t = ci->map;
>       struct bio flush_bio;
> +     unsigned long *handled_map;

Please rename, e.g.: handled_devices_bitmap

> +     unsigned int nr_handled = 0;
> +     bool check = has_redundant_flush(t, &handled_map);
>  
>       /*
>        * Use an on-stack bio for this, it's safe since we don't
> @@ -1562,17 +1592,46 @@ static void __send_empty_flush(struct clone_info *ci)
>  
>       for (unsigned int i = 0; i < t->num_targets; i++) {
>               unsigned int bios;
> +             unsigned short index = 0;
>               struct dm_target *ti = dm_table_get_target(t, i);
>  
>               if (unlikely(ti->num_flush_bios == 0))
>                       continue;
>  
> +             /*
> +              * If the num_targets is greater than the number of block 
> devices
> +              * in the dm_table's devices list, __send_empty_flush() might
> +              * invoke __send_duplicate_bios() multiple times for the same
> +              * block device. This could lead to a substantial decrease in
> +              * performance when num_targets significantly exceeds the number
> +              * of block devices.
> +              * Ensure that __send_duplicate_bios() is only called once for
> +              * each block device.
> +              */
> +             if (check) {
> +                     if (nr_handled == t->num_devices)
> +                             break;
> +
> +                     if (ti->type->iterate_devices)
> +                             ti->type->iterate_devices(ti, dm_get_dev_index, 
> &index);

You're looping through all data devices in a target, so you're only
getting the last device in the target's index.  That seems very
broken.

But each target in your test 'example' device (mentioned in the patch
header) only has a single device so you wouldn't have noticed this.

> +
> +                     if (index > 0) {
> +                             if (__test_and_set_bit(index, handled_map))
> +                                     continue;
> +                             else
> +                                     nr_handled++;

Think you really mean to set bits in the bitmap within the
iterate_devices callout.  So it should be renamed accordingly.

Why not count the first time a device is handled in nr_handled?
Also, it strikes me as strange that you're break'ing out this loop
early based on nr_handled... not seeing the point (and also it seems
broken, because it implies you aren't issuing flushes to targets at
the end).

> +                     }
> +             }
> +
>               atomic_add(ti->num_flush_bios, &ci->io->io_count);
>               bios = __send_duplicate_bios(ci, ti, ti->num_flush_bios,
>                                            NULL, GFP_NOWAIT);
>               atomic_sub(ti->num_flush_bios - bios, &ci->io->io_count);
>       }
>  
> +     if (check)
> +             bitmap_free(handled_map);
> +
>       /*
>        * alloc_io() takes one extra reference for submission, so the
>        * reference won't reach 0 without the following subtraction
> diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
> index 82b2195efaca..4a54b4f0a609 100644
> --- a/include/linux/device-mapper.h
> +++ b/include/linux/device-mapper.h
> @@ -169,6 +169,12 @@ struct dm_dev {
>       struct dax_device *dax_dev;
>       blk_mode_t mode;
>       char name[16];
> +
> +     /*
> +      * sequential number for each dm_dev in dm_table's devices list,
> +      * start from 1
> +      */
> +     unsigned short index;

Please update this comment to indicate that index=0 is (ab)used as a
flag in __send_empty_flush().

Reply via email to