Re: [PATCH v4 46/48] mm: shrinker: make memcg slab shrink lockless
clear_bit(offset, unit->map); > + if (unlikely(!shrinker || !shrinker_try_get(shrinker))) > { > + clear_bit(offset, unit->map); > + rcu_read_unlock(); > continue; > } > + rcu_read_unlock(); > > /* Call non-slab shrinkers even though kmem is disabled > */ > if (!memcg_kmem_online() && > @@ -523,15 +546,20 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, > int nid, > set_shrinker_bit(memcg, nid, > shrinker_id); > } > freed += ret; > - > - if (rwsem_is_contended(&shrinker_rwsem)) { > - freed = freed ? : 1; > - goto unlock; > - } > + shrinker_put(shrinker); Ok, so why is this safe to call without holding the rcu read lock? The global shrinker has to hold the rcu_read_lock() whilst calling shrinker_put() to guarantee the validity of the list next pointer, but we don't hold off RCU here so what guarantees a racing global shrinker walk doesn't trip over this shrinker_put() call dropping the refcount to zero and freeing occuring in a different context... > + /* > + * We have already exited the read-side of rcu critical section > + * before calling do_shrink_slab(), the shrinker_info may be > + * released in expand_one_shrinker_info(), so reacquire the > + * shrinker_info. > + */ > + index++; > + goto again; With that, what makes the use of shrinker_info in xchg_nr_deferred_memcg() in do_shrink_slab() coherent and valid? -Dave. -- Dave Chinner da...@fromorbit.com ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v4 45/48] mm: shrinker: make global slab shrink lockless
On Mon, Aug 07, 2023 at 07:09:33PM +0800, Qi Zheng wrote: > diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h > index eb342994675a..f06225f18531 100644 > --- a/include/linux/shrinker.h > +++ b/include/linux/shrinker.h > @@ -4,6 +4,8 @@ > > #include > #include > +#include > +#include > > #define SHRINKER_UNIT_BITS BITS_PER_LONG > > @@ -87,6 +89,10 @@ struct shrinker { > int seeks; /* seeks to recreate an obj */ > unsigned flags; > > + refcount_t refcount; > + struct completion done; > + struct rcu_head rcu; Documentation, please. What does the refcount protect, what does the completion provide, etc. > + > void *private_data; > > /* These are for internal use */ > @@ -120,6 +126,17 @@ struct shrinker *shrinker_alloc(unsigned int flags, > const char *fmt, ...); > void shrinker_register(struct shrinker *shrinker); > void shrinker_free(struct shrinker *shrinker); > > +static inline bool shrinker_try_get(struct shrinker *shrinker) > +{ > + return refcount_inc_not_zero(&shrinker->refcount); > +} > + > +static inline void shrinker_put(struct shrinker *shrinker) > +{ > + if (refcount_dec_and_test(&shrinker->refcount)) > + complete(&shrinker->done); > +} > + > #ifdef CONFIG_SHRINKER_DEBUG > extern int __printf(2, 3) shrinker_debugfs_rename(struct shrinker *shrinker, > const char *fmt, ...); > diff --git a/mm/shrinker.c b/mm/shrinker.c > index 1911c06b8af5..d318f5621862 100644 > --- a/mm/shrinker.c > +++ b/mm/shrinker.c > @@ -2,6 +2,7 @@ > #include > #include > #include > +#include > #include > > #include "internal.h" > @@ -577,33 +578,42 @@ unsigned long shrink_slab(gfp_t gfp_mask, int nid, > struct mem_cgroup *memcg, > if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg)) > return shrink_slab_memcg(gfp_mask, nid, memcg, priority); > > - if (!down_read_trylock(&shrinker_rwsem)) > - goto out; > - > - list_for_each_entry(shrinker, &shrinker_list, list) { > + rcu_read_lock(); > + list_for_each_entry_rcu(shrinker, &shrinker_list, list) { > struct shrink_control sc = { > .gfp_mask = gfp_mask, > .nid = nid, > .memcg = memcg, > }; > > + if (!shrinker_try_get(shrinker)) > + continue; > + > + /* > + * We can safely unlock the RCU lock here since we already > + * hold the refcount of the shrinker. > + */ > + rcu_read_unlock(); > + > ret = do_shrink_slab(&sc, shrinker, priority); > if (ret == SHRINK_EMPTY) > ret = 0; > freed += ret; > + > /* > - * Bail out if someone want to register a new shrinker to > - * prevent the registration from being stalled for long periods > - * by parallel ongoing shrinking. > + * This shrinker may be deleted from shrinker_list and freed > + * after the shrinker_put() below, but this shrinker is still > + * used for the next traversal. So it is necessary to hold the > + * RCU lock first to prevent this shrinker from being freed, > + * which also ensures that the next shrinker that is traversed > + * will not be freed (even if it is deleted from shrinker_list > + * at the same time). >*/ This needs to be moved to the head of the function, and document the whole list walk, get, put and completion parts of the algorithm that make it safe. There's more to this than "we hold a reference count", especially the tricky "we might see the shrinker before it is fully initialised" case . > void shrinker_free(struct shrinker *shrinker) > { > struct dentry *debugfs_entry = NULL; > @@ -686,9 +712,18 @@ void shrinker_free(struct shrinker *shrinker) > if (!shrinker) > return; > > + if (shrinker->flags & SHRINKER_REGISTERED) { > + shrinker_put(shrinker); > + wait_for_completion(&shrinker->done); > + } Needs a comment explaining why we need to wait here... > + > down_write(&shrinker_rwsem); > if (shrinker->flags & SHRINKER_REGISTERED) { > - list_del(&shrinker->list); > + /* > + * Lookups on the shrinker are over and
Re: [PATCH v4 44/48] mm: shrinker: add a secondary array for shrinker_info::{map, nr_deferred}
atomic_long_add(nr, > &parent_unit->nr_deferred[offset]); > + } > } > } > up_read(&shrinker_rwsem); > @@ -407,7 +458,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, > int nid, > { > struct shrinker_info *info; > unsigned long ret, freed = 0; > - int i; > + int offset, index = 0; > > if (!mem_cgroup_online(memcg)) > return 0; > @@ -419,56 +470,63 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, > int nid, > if (unlikely(!info)) > goto unlock; > > - for_each_set_bit(i, info->map, info->map_nr_max) { > - struct shrink_control sc = { > - .gfp_mask = gfp_mask, > - .nid = nid, > - .memcg = memcg, > - }; > - struct shrinker *shrinker; > + for (; index < shriner_id_to_index(info->map_nr_max); index++) { > + struct shrinker_info_unit *unit; This adds another layer of indent to shrink_slab_memcg(). Please factor it first so that the code ends up being readable. Doing that first as a separate patch will also make the actual algorithm changes in this patch be much more obvious - this huge hunk of diff is pretty much impossible to review... -Dave. -- Dave Chinner da...@fromorbit.com ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v4 45/48] mm: shrinker: make global slab shrink lockless
On Mon, Aug 07, 2023 at 07:09:33PM +0800, Qi Zheng wrote: > The shrinker_rwsem is a global read-write lock in shrinkers subsystem, > which protects most operations such as slab shrink, registration and > unregistration of shrinkers, etc. This can easily cause problems in the > following cases. > This commit uses the refcount+RCU method [5] proposed by Dave Chinner > to re-implement the lockless global slab shrink. The memcg slab shrink is > handled in the subsequent patch. > --- > include/linux/shrinker.h | 17 ++ > mm/shrinker.c| 70 +--- > 2 files changed, 68 insertions(+), 19 deletions(-) There's no documentation in the code explaining how the lockless shrinker algorithm works. It's left to the reader to work out how this all goes together > diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h > index eb342994675a..f06225f18531 100644 > --- a/include/linux/shrinker.h > +++ b/include/linux/shrinker.h > @@ -4,6 +4,8 @@ > > #include > #include > +#include > +#include > > #define SHRINKER_UNIT_BITS BITS_PER_LONG > > @@ -87,6 +89,10 @@ struct shrinker { > int seeks; /* seeks to recreate an obj */ > unsigned flags; > > + refcount_t refcount; > + struct completion done; > + struct rcu_head rcu; What does the refcount protect, why do we need the completion, etc? > + > void *private_data; > > /* These are for internal use */ > @@ -120,6 +126,17 @@ struct shrinker *shrinker_alloc(unsigned int flags, > const char *fmt, ...); > void shrinker_register(struct shrinker *shrinker); > void shrinker_free(struct shrinker *shrinker); > > +static inline bool shrinker_try_get(struct shrinker *shrinker) > +{ > + return refcount_inc_not_zero(&shrinker->refcount); > +} > + > +static inline void shrinker_put(struct shrinker *shrinker) > +{ > + if (refcount_dec_and_test(&shrinker->refcount)) > + complete(&shrinker->done); > +} > + > #ifdef CONFIG_SHRINKER_DEBUG > extern int __printf(2, 3) shrinker_debugfs_rename(struct shrinker *shrinker, > const char *fmt, ...); > diff --git a/mm/shrinker.c b/mm/shrinker.c > index 1911c06b8af5..d318f5621862 100644 > --- a/mm/shrinker.c > +++ b/mm/shrinker.c > @@ -2,6 +2,7 @@ > #include > #include > #include > +#include > #include > > #include "internal.h" > @@ -577,33 +578,42 @@ unsigned long shrink_slab(gfp_t gfp_mask, int nid, > struct mem_cgroup *memcg, > if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg)) > return shrink_slab_memcg(gfp_mask, nid, memcg, priority); > > - if (!down_read_trylock(&shrinker_rwsem)) > - goto out; > - > - list_for_each_entry(shrinker, &shrinker_list, list) { > + rcu_read_lock(); > + list_for_each_entry_rcu(shrinker, &shrinker_list, list) { > struct shrink_control sc = { > .gfp_mask = gfp_mask, > .nid = nid, > .memcg = memcg, > }; > > + if (!shrinker_try_get(shrinker)) > + continue; > + > + /* > + * We can safely unlock the RCU lock here since we already > + * hold the refcount of the shrinker. > + */ > + rcu_read_unlock(); > + > ret = do_shrink_slab(&sc, shrinker, priority); > if (ret == SHRINK_EMPTY) > ret = 0; > freed += ret; > + > /* > - * Bail out if someone want to register a new shrinker to > - * prevent the registration from being stalled for long periods > - * by parallel ongoing shrinking. > + * This shrinker may be deleted from shrinker_list and freed > + * after the shrinker_put() below, but this shrinker is still > + * used for the next traversal. So it is necessary to hold the > + * RCU lock first to prevent this shrinker from being freed, > + * which also ensures that the next shrinker that is traversed > + * will not be freed (even if it is deleted from shrinker_list > + * at the same time). >*/ This comment really should be at the head of the function, describing the algorithm used within the function itself. i.e. how reference counts are used w.r.t. the rcu_read_lock() usage to guarantee existence of the shrinker and the validity of the list walk. I'm not go
Re: [PATCH v3 28/49] dm zoned: dynamically allocate the dm-zoned-meta shrinker
On Thu, Jul 27, 2023 at 07:20:46PM +0900, Damien Le Moal wrote: > On 7/27/23 17:55, Qi Zheng wrote: > >>> goto err; > >>> } > >>> + zmd->mblk_shrinker->count_objects = dmz_mblock_shrinker_count; > >>> + zmd->mblk_shrinker->scan_objects = dmz_mblock_shrinker_scan; > >>> + zmd->mblk_shrinker->seeks = DEFAULT_SEEKS; > >>> + zmd->mblk_shrinker->private_data = zmd; > >>> + > >>> + shrinker_register(zmd->mblk_shrinker); > >> > >> I fail to see how this new shrinker API is better... Why isn't there a > >> shrinker_alloc_and_register() function ? That would avoid adding all this > >> code > >> all over the place as the new API call would be very similar to the current > >> shrinker_register() call with static allocation. > > > > In some registration scenarios, memory needs to be allocated in advance. > > So we continue to use the previous prealloc/register_prepared() > > algorithm. The shrinker_alloc_and_register() is just a helper function > > that combines the two, and this increases the number of APIs that > > shrinker exposes to the outside, so I choose not to add this helper. > > And that results in more code in many places instead of less code + a simple > inline helper in the shrinker header file... It's not just a "simple helper" - it's a function that has to take 6 or 7 parameters with a return value that must be checked and handled. This was done in the first versions of the patch set - the amount of code in each caller does not go down and, IMO, was much harder to read and determine "this is obviously correct" that what we have now. > So not adding that super simple > helper is not exactly the best choice in my opinion. Each to their own - I much prefer the existing style/API over having to go look up a helper function every time I want to check some random shrinker has been set up correctly -Dave. -- Dave Chinner da...@fromorbit.com ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 44/47] mm: shrinker: make global slab shrink lockless
On Wed, Jul 26, 2023 at 05:14:09PM +0800, Qi Zheng wrote: > On 2023/7/26 16:08, Dave Chinner wrote: > > On Mon, Jul 24, 2023 at 05:43:51PM +0800, Qi Zheng wrote: > > > @@ -122,6 +126,13 @@ void shrinker_free_non_registered(struct shrinker > > > *shrinker); > > > void shrinker_register(struct shrinker *shrinker); > > > void shrinker_unregister(struct shrinker *shrinker); > > > +static inline bool shrinker_try_get(struct shrinker *shrinker) > > > +{ > > > + return READ_ONCE(shrinker->registered) && > > > +refcount_inc_not_zero(&shrinker->refcount); > > > +} > > > > Why do we care about shrinker->registered here? If we don't set > > the refcount to 1 until we have fully initialised everything, then > > the shrinker code can key entirely off the reference count and > > none of the lookup code needs to care about whether the shrinker is > > registered or not. > > The purpose of checking shrinker->registered here is to stop running > shrinker after calling shrinker_free(), which can prevent the following > situations from happening: > > CPU 0 CPU 1 > > shrinker_try_get() > >shrinker_try_get() > > shrinker_put() > shrinker_try_get() >shrinker_put() I don't see any race here? What is wrong with having multiple active users at once? > > > > This should use a completion, then it is always safe under > > rcu_read_lock(). This also gets rid of the shrinker_lock spin lock, > > which only exists because we can't take a blocking lock under > > rcu_read_lock(). i.e: > > > > > > void shrinker_put(struct shrinker *shrinker) > > { > > if (refcount_dec_and_test(&shrinker->refcount)) > > complete(&shrinker->done); > > } > > > > void shrinker_free() > > { > > . > > refcount_dec(&shrinker->refcount); > > I guess what you mean is shrinker_put(), because here may be the last > refcount. Yes, I did. > > wait_for_completion(&shrinker->done); > > /* > > * lookups on the shrinker will now all fail as refcount has > > * fallen to zero. We can now remove it from the lists and > > * free it. > > */ > > down_write(shrinker_rwsem); > > list_del_rcu(&shrinker->list); > > up_write(&shrinker_rwsem); > > call_rcu(shrinker->rcu_head, shrinker_free_rcu_cb); > > } > > > > > > > > > @@ -686,11 +711,14 @@ EXPORT_SYMBOL(shrinker_free_non_registered); > > > void shrinker_register(struct shrinker *shrinker) > > > { > > > - down_write(&shrinker_rwsem); > > > - list_add_tail(&shrinker->list, &shrinker_list); > > > - shrinker->flags |= SHRINKER_REGISTERED; > > > + refcount_set(&shrinker->refcount, 1); > > > + > > > + spin_lock(&shrinker_lock); > > > + list_add_tail_rcu(&shrinker->list, &shrinker_list); > > > + spin_unlock(&shrinker_lock); > > > + > > > shrinker_debugfs_add(shrinker); > > > - up_write(&shrinker_rwsem); > > > + WRITE_ONCE(shrinker->registered, true); > > > } > > > EXPORT_SYMBOL(shrinker_register); > > > > This just looks wrong - you are trying to use WRITE_ONCE() as a > > release barrier to indicate that the shrinker is now set up fully. > > That's not necessary - the refcount is an atomic and along with the > > rcu locks they should provides all the barriers we need. i.e. > > The reason I used WRITE_ONCE() here is because the shrinker->registered > will be read and written concurrently (read in shrinker_try_get() and > written in shrinker_free()), which is why I added shrinker::registered > field instead of using SHRINKER_REGISTERED flag (this can reduce the > addition of WRITE_ONCE()/READ_ONCE()). Using WRITE_ONCE/READ_ONCE doesn't provide memory barriers needed to use the field like this. You need release/acquire memory ordering here. i.e. smp_store_release()/smp_load_acquire(). As it is, the refcount_inc_not_zero() provides a control dependency, as documented in include/linux/refcount.h, refcount_dec_and_test() provides release memory ordering. The only thing I think we may need is a write barrier before refcount_set(), such that if refcount_inc_not_zero() sees a non-zero value, it is guaranteed to see an initialised structure... i.e. refcounts provide all the existence and initialisation guarantees. Hence I don't see the need to use shr
Re: [PATCH v2 44/47] mm: shrinker: make global slab shrink lockless
On Mon, Jul 24, 2023 at 05:43:51PM +0800, Qi Zheng wrote: > The shrinker_rwsem is a global read-write lock in shrinkers subsystem, > which protects most operations such as slab shrink, registration and > unregistration of shrinkers, etc. This can easily cause problems in the > following cases. > > 1) When the memory pressure is high and there are many filesystems >mounted or unmounted at the same time, slab shrink will be affected >(down_read_trylock() failed). > >Such as the real workload mentioned by Kirill Tkhai: > >``` >One of the real workloads from my experience is start >of an overcommitted node containing many starting >containers after node crash (or many resuming containers >after reboot for kernel update). In these cases memory >pressure is huge, and the node goes round in long reclaim. >``` > > 2) If a shrinker is blocked (such as the case mentioned >in [1]) and a writer comes in (such as mount a fs), >then this writer will be blocked and cause all >subsequent shrinker-related operations to be blocked. > > Even if there is no competitor when shrinking slab, there may still be a > problem. The down_read_trylock() may become a perf hotspot with frequent > calls to shrink_slab(). Because of the poor multicore scalability of > atomic operations, this can lead to a significant drop in IPC > (instructions per cycle). > > We used to implement the lockless slab shrink with SRCU [2], but then > kernel test robot reported -88.8% regression in > stress-ng.ramfs.ops_per_sec test case [3], so we reverted it [4]. > > This commit uses the refcount+RCU method [5] proposed by Dave Chinner > to re-implement the lockless global slab shrink. The memcg slab shrink is > handled in the subsequent patch. > > For now, all shrinker instances are converted to dynamically allocated and > will be freed by kfree_rcu(). So we can use rcu_read_{lock,unlock}() to > ensure that the shrinker instance is valid. > > And the shrinker instance will not be run again after unregistration. So > the structure that records the pointer of shrinker instance can be safely > freed without waiting for the RCU read-side critical section. > > In this way, while we implement the lockless slab shrink, we don't need to > be blocked in unregister_shrinker(). > > The following are the test results: > > stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 & > > 1) Before applying this patchset: > > setting to a 60 second run per stressor > dispatching hogs: 9 ramfs > stressor bogo ops real time usr time sys time bogo ops/s bogo > ops/s > (secs)(secs)(secs) (real time) (usr+sys > time) > ramfs735238 60.00 12.37363.70 12253.05 > 1955.08 > for a 60.01s run time: >1440.27s available CPU time > 12.36s user time ( 0.86%) > 363.70s system time ( 25.25%) > 376.06s total time ( 26.11%) > load average: 10.79 4.47 1.69 > passed: 9: ramfs (9) > failed: 0 > skipped: 0 > successful run completed in 60.01s (1 min, 0.01 secs) > > 2) After applying this patchset: > > setting to a 60 second run per stressor > dispatching hogs: 9 ramfs > stressor bogo ops real time usr time sys time bogo ops/s bogo > ops/s > (secs)(secs)(secs) (real time) (usr+sys > time) > ramfs746677 60.00 12.22367.75 12443.70 > 1965.13 > for a 60.01s run time: >1440.26s available CPU time > 12.21s user time ( 0.85%) > 367.75s system time ( 25.53%) > 379.96s total time ( 26.38%) > load average: 8.37 2.48 0.86 > passed: 9: ramfs (9) > failed: 0 > skipped: 0 > successful run completed in 60.01s (1 min, 0.01 secs) > > We can see that the ops/s has hardly changed. > > [1]. > https://lore.kernel.org/lkml/20191129214541.3110-1-ptikhomi...@virtuozzo.com/ > [2]. > https://lore.kernel.org/lkml/20230313112819.38938-1-zhengqi.a...@bytedance.com/ > [3]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie@intel.com/ > [4]. https://lore.kernel.org/all/20230609081518.3039120-1-qi.zh...@linux.dev/ > [5]. https://lore.kernel.org/lkml/zijhou1d55d4h...@dread.disaster.area/ > > Signed-off-by: Qi Zheng > --- > include/linux/shrinker.h | 19 +++--- > mm/shrinker.c| 75 ++-- > mm/shrinker_debug.c | 52 +--- > 3 files changed, 104 insertions(+), 42 deletions(-) > > diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h > index 36977a70bebb..335da93cccee 100644 > ---
Re: [PATCH v2 03/47] mm: shrinker: add infrastructure for dynamically allocating shrinker
On Mon, Jul 24, 2023 at 05:43:10PM +0800, Qi Zheng wrote: > Currently, the shrinker instances can be divided into the following three > types: > > a) global shrinker instance statically defined in the kernel, such as >workingset_shadow_shrinker. > > b) global shrinker instance statically defined in the kernel modules, such >as mmu_shrinker in x86. > > c) shrinker instance embedded in other structures. > > For case a, the memory of shrinker instance is never freed. For case b, > the memory of shrinker instance will be freed after synchronize_rcu() when > the module is unloaded. For case c, the memory of shrinker instance will > be freed along with the structure it is embedded in. > > In preparation for implementing lockless slab shrink, we need to > dynamically allocate those shrinker instances in case c, then the memory > can be dynamically freed alone by calling kfree_rcu(). > > So this commit adds the following new APIs for dynamically allocating > shrinker, and add a private_data field to struct shrinker to record and > get the original embedded structure. > > 1. shrinker_alloc() > > Used to allocate shrinker instance itself and related memory, it will > return a pointer to the shrinker instance on success and NULL on failure. > > 2. shrinker_free_non_registered() > > Used to destroy the non-registered shrinker instance. This is a bit nasty > > 3. shrinker_register() > > Used to register the shrinker instance, which is same as the current > register_shrinker_prepared(). > > 4. shrinker_unregister() rename this "shrinker_free()" and key the two different freeing cases on the SHRINKER_REGISTERED bit rather than mostly duplicating the two. void shrinker_free(struct shrinker *shrinker) { struct dentry *debugfs_entry = NULL; int debugfs_id; if (!shrinker) return; down_write(&shrinker_rwsem); if (shrinker->flags & SHRINKER_REGISTERED) { list_del(&shrinker->list); debugfs_entry = shrinker_debugfs_detach(shrinker, &debugfs_id); } else if (IS_ENABLED(CONFIG_SHRINKER_DEBUG)) { kfree_const(shrinker->name); } if (shrinker->flags & SHRINKER_MEMCG_AWARE) unregister_memcg_shrinker(shrinker); up_write(&shrinker_rwsem); if (debugfs_entry) shrinker_debugfs_remove(debugfs_entry, debugfs_id); kfree(shrinker->nr_deferred); kfree(shrinker); } EXPORT_SYMBOL_GPL(shrinker_free); -- Dave Chinner da...@fromorbit.com ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 24/29] mm: vmscan: make global slab shrink lockless
On Fri, Jun 23, 2023 at 09:10:57PM +0800, Qi Zheng wrote: > On 2023/6/23 14:29, Dave Chinner wrote: > > On Thu, Jun 22, 2023 at 05:12:02PM +0200, Vlastimil Babka wrote: > > > On 6/22/23 10:53, Qi Zheng wrote: > > Yes, I suggested the IDR route because radix tree lookups under RCU > > with reference counted objects are a known safe pattern that we can > > easily confirm is correct or not. Hence I suggested the unification > > + IDR route because it makes the life of reviewers so, so much > > easier... > > In fact, I originally planned to try the unification + IDR method you > suggested at the beginning. But in the case of CONFIG_MEMCG disabled, > the struct mem_cgroup is not even defined, and root_mem_cgroup and > shrinker_info will not be allocated. This required more code changes, so > I ended up keeping the shrinker_list and implementing the above pattern. Yes. Go back and read what I originally said needed to be done first. In the case of CONFIG_MEMCG=n, a dummy root memcg still needs to exist that holds all of the global shrinkers. Then shrink_slab() is only ever passed a memcg that should be iterated. Yes, it needs changes external to the shrinker code itself to be made to work. And even if memcg's are not enabled, we can still use the memcg structures to ensure a common abstraction is used for the shrinker tracking infrastructure > If the above pattern is not safe, I will go back to the unification + > IDR method. And that is exactly how we got into this mess in the first place -Dave -- Dave Chinner da...@fromorbit.com ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 24/29] mm: vmscan: make global slab shrink lockless
On Thu, Jun 22, 2023 at 05:12:02PM +0200, Vlastimil Babka wrote: > On 6/22/23 10:53, Qi Zheng wrote: > > @@ -1067,33 +1068,27 @@ static unsigned long shrink_slab(gfp_t gfp_mask, > > int nid, > > if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg)) > > return shrink_slab_memcg(gfp_mask, nid, memcg, priority); > > > > - if (!down_read_trylock(&shrinker_rwsem)) > > - goto out; > > - > > - list_for_each_entry(shrinker, &shrinker_list, list) { > > + rcu_read_lock(); > > + list_for_each_entry_rcu(shrinker, &shrinker_list, list) { > > struct shrink_control sc = { > > .gfp_mask = gfp_mask, > > .nid = nid, > > .memcg = memcg, > > }; > > > > + if (!shrinker_try_get(shrinker)) > > + continue; > > + rcu_read_unlock(); > > I don't think you can do this unlock? > > > + > > ret = do_shrink_slab(&sc, shrinker, priority); > > if (ret == SHRINK_EMPTY) > > ret = 0; > > freed += ret; > > - /* > > -* Bail out if someone want to register a new shrinker to > > -* prevent the registration from being stalled for long periods > > -* by parallel ongoing shrinking. > > -*/ > > - if (rwsem_is_contended(&shrinker_rwsem)) { > > - freed = freed ? : 1; > > - break; > > - } > > - } > > > > - up_read(&shrinker_rwsem); > > -out: > > + rcu_read_lock(); > > That new rcu_read_lock() won't help AFAIK, the whole > list_for_each_entry_rcu() needs to be under the single rcu_read_lock() to be > safe. Yeah, that's the pattern we've been taught and the one we can look at and immediately say "this is safe". This is a different pattern, as has been explained bi Qi, and I think it *might* be safe. *However.* Right now I don't have time to go through a novel RCU list iteration pattern it one step at to determine the correctness of the algorithm. I'm mostly worried about list manipulations that can occur outside rcu_read_lock() section bleeding into the RCU critical section because rcu_read_lock() by itself is not a memory barrier. Maybe Paul has seen this pattern often enough he could simply tell us what conditions it is safe in. But for me to work that out from first principles? I just don't have the time to do that right now. > IIUC this is why Dave in [4] suggests unifying shrink_slab() with > shrink_slab_memcg(), as the latter doesn't iterate the list but uses IDR. Yes, I suggested the IDR route because radix tree lookups under RCU with reference counted objects are a known safe pattern that we can easily confirm is correct or not. Hence I suggested the unification + IDR route because it makes the life of reviewers so, so much easier... Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 02/29] mm: vmscan: introduce some helpers for dynamically allocating shrinker
On Thu, Jun 22, 2023 at 04:53:08PM +0800, Qi Zheng wrote: > Introduce some helpers for dynamically allocating shrinker instance, > and their uses are as follows: > > 1. shrinker_alloc_and_init() > > Used to allocate and initialize a shrinker instance, the priv_data > parameter is used to pass the pointer of the previously embedded > structure of the shrinker instance. > > 2. shrinker_free() > > Used to free the shrinker instance when the registration of shrinker > fails. > > 3. unregister_and_free_shrinker() > > Used to unregister and free the shrinker instance, and the kfree() > will be changed to kfree_rcu() later. > > Signed-off-by: Qi Zheng > --- > include/linux/shrinker.h | 12 > mm/vmscan.c | 35 +++ > 2 files changed, 47 insertions(+) > > diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h > index 43e6fcabbf51..8e9ba6fa3fcc 100644 > --- a/include/linux/shrinker.h > +++ b/include/linux/shrinker.h > @@ -107,6 +107,18 @@ extern void unregister_shrinker(struct shrinker > *shrinker); > extern void free_prealloced_shrinker(struct shrinker *shrinker); > extern void synchronize_shrinkers(void); > > +typedef unsigned long (*count_objects_cb)(struct shrinker *s, > + struct shrink_control *sc); > +typedef unsigned long (*scan_objects_cb)(struct shrinker *s, > + struct shrink_control *sc); > + > +struct shrinker *shrinker_alloc_and_init(count_objects_cb count, > + scan_objects_cb scan, long batch, > + int seeks, unsigned flags, > + void *priv_data); > +void shrinker_free(struct shrinker *shrinker); > +void unregister_and_free_shrinker(struct shrinker *shrinker); H. Not exactly how I envisioned this to be done. Ok, this will definitely work, but I don't think it is an improvement. It's certainly not what I was thinking of when I suggested dynamically allocating shrinkers. The main issue is that this doesn't simplify the API - it expands it and creates a minefield of old and new functions that have to be used in exactly the right order for the right things to happen. What I was thinking of was moving the entire shrinker setup code over to the prealloc/register_prepared() algorithm, where the setup is already separated from the activation of the shrinker. That is, we start by renaming prealloc_shrinker() to shrinker_alloc(), adding a flags field to tell it everything that it needs to alloc (i.e. the NUMA/MEMCG_AWARE flags) and having it returned a fully allocated shrinker ready to register. Initially this also contains an internal flag to say the shrinker was allocated so that unregister_shrinker() knows to free it. The caller then fills out the shrinker functions, seeks, etc. just like the do now, and then calls register_shrinker_prepared() to make the shrinker active when it wants to turn it on. When it is time to tear down the shrinker, no API needs to change. unregister_shrinker() does all the shutdown and frees all the internal memory like it does now. If the shrinker is also marked as allocated, it frees the shrinker via RCU, too. Once everything is converted to this API, we then remove register_shrinker(), rename register_shrinker_prepared() to shrinker_register(), rename unregister_shrinker to shrinker_unregister(), get rid of the internal "allocated" flag and always free the shrinker. At the end of the patchset, every shrinker should be set up in a manner like this: sb->shrinker = shrinker_alloc(SHRINKER_MEMCG_AWARE|SHRINKER_NUMA_AWARE, "sb-%s", type->name); if (!sb->shrinker) return -ENOMEM; sb->shrinker->count_objects = super_cache_count; sb->shrinker->scan_objects = super_cache_scan; sb->shrinker->batch = 1024; sb->shrinker->private = sb; . shrinker_register(sb->shrinker); And teardown is just a call to shrinker_unregister(sb->shrinker) as it is now. i.e. the entire shrinker regsitration API is now just three functions, down from the current four, and much simpler than the the seven functions this patch set results in... The other advantage of this is that it will break all the existing out of tree code and third party modules using the old API and will no longer work with a kernel using lockless slab shrinkers. They need to break (both at the source and binary levels) to stop bad things from happening due to using uncoverted shrinkers in the new setup. -Dave. -- Dave Chinner da...@fromorbit.com ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v6 6/6] xfs: disable map_sync for async flush
On Tue, Apr 23, 2019 at 01:36:12PM +0530, Pankaj Gupta wrote: > Dont support 'MAP_SYNC' with non-DAX files and DAX files > with asynchronous dax_device. Virtio pmem provides > asynchronous host page cache flush mechanism. We don't > support 'MAP_SYNC' with virtio pmem and xfs. > > Signed-off-by: Pankaj Gupta > --- > fs/xfs/xfs_file.c | 10 ++ > 1 file changed, 6 insertions(+), 4 deletions(-) > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > index 1f2e2845eb76..0e59be018511 100644 > --- a/fs/xfs/xfs_file.c > +++ b/fs/xfs/xfs_file.c > @@ -1196,11 +1196,13 @@ xfs_file_mmap( > struct file *filp, > struct vm_area_struct *vma) > { > - /* > - * We don't support synchronous mappings for non-DAX files. At least > - * until someone comes with a sensible use case. > + struct dax_device *dax_dev = xfs_find_daxdev_for_inode > + (file_inode(filp)); > + > + /* We don't support synchronous mappings for non-DAX files and > + * for DAX files if underneath dax_device is not synchronous. >*/ /* * This is the correct multi-line comment format. Please * update the patch to maintain the existing comment format. */ Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v4 5/5] xfs: disable map_sync for async flush
On Wed, Apr 03, 2019 at 04:10:18PM +0530, Pankaj Gupta wrote: > Virtio pmem provides asynchronous host page cache flush > mechanism. we don't support 'MAP_SYNC' with virtio pmem > and xfs. > > Signed-off-by: Pankaj Gupta > --- > fs/xfs/xfs_file.c | 8 > 1 file changed, 8 insertions(+) > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > index 1f2e2845eb76..dced2eb8c91a 100644 > --- a/fs/xfs/xfs_file.c > +++ b/fs/xfs/xfs_file.c > @@ -1203,6 +1203,14 @@ xfs_file_mmap( > if (!IS_DAX(file_inode(filp)) && (vma->vm_flags & VM_SYNC)) > return -EOPNOTSUPP; > > + /* We don't support synchronous mappings with DAX files if > + * dax_device is not synchronous. > + */ > + if (IS_DAX(file_inode(filp)) && !dax_synchronous( > + xfs_find_daxdev_for_inode(file_inode(filp))) && > + (vma->vm_flags & VM_SYNC)) > + return -EOPNOTSUPP; > + > file_accessed(filp); > vma->vm_ops = &xfs_file_vm_ops; > if (IS_DAX(file_inode(filp))) All this ad hoc IS_DAX conditional logic is getting pretty nasty. xfs_file_mmap( { struct inode*inode = file_inode(filp); if (vma->vm_flags & VM_SYNC) { if (!IS_DAX(inode)) return -EOPNOTSUPP; if (!dax_synchronous(xfs_find_daxdev_for_inode(inode)) return -EOPNOTSUPP; } file_accessed(filp); vma->vm_ops = &xfs_file_vm_ops; if (IS_DAX(inode)) vma->vm_flags |= VM_HUGEPAGE; return 0; } Even better, factor out all the "MAP_SYNC supported" checks into a helper so that the filesystem code just doesn't have to care about the details of checking for DAX+MAP_SYNC support Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Qemu-devel] security implications of caching with virtio pmem (was Re: [PATCH v3 0/5] kvm "virtio pmem" device)
On Mon, Feb 11, 2019 at 02:29:46AM -0500, Pankaj Gupta wrote: > Hello Dave, > Are we okay with this? Sure. I'm not sure I agree with all the analysis presented, but, well, I haven't looked any deeper because I'm tired of being shouted at and being called argumentative for daring to ask hard questions about this topic Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v3 0/5] kvm "virtio pmem" device
On Mon, Jan 14, 2019 at 01:35:57PM -0800, Dan Williams wrote: > On Mon, Jan 14, 2019 at 1:25 PM Dave Chinner wrote: > > > > On Mon, Jan 14, 2019 at 02:15:40AM -0500, Pankaj Gupta wrote: > > > > > > > > Until you have images (and hence host page cache) shared between > > > > > multiple guests. People will want to do this, because it means they > > > > > only need a single set of pages in host memory for executable > > > > > binaries rather than a set of pages per guest. Then you have > > > > > multiple guests being able to detect residency of the same set of > > > > > pages. If the guests can then, in any way, control eviction of the > > > > > pages from the host cache, then we have a guest-to-guest information > > > > > leak channel. > > > > > > > > I don't think we should ever be considering something that would allow a > > > > guest to evict page's from the host's pagecache [1]. The guest should > > > > be able to kick its own references to the host's pagecache out of its > > > > own pagecache, but not be able to influence whether the host or another > > > > guest has a read-only mapping cached. > > > > > > > > [1] Unless the guest is allowed to modify the host's file; obviously > > > > truncation, holepunching, etc are going to evict pages from the host's > > > > page cache. > > > > > > This is so correct. Guest does not not evict host page cache pages > > > directly. > > > > They don't right now. > > > > But someone is going to end up asking for discard to work so that > > the guest can free unused space in the underlying spares image (i.e. > > make use of fstrim or mount -o discard) because they have workloads > > that have bursts of space usage and they need to trim the image > > files afterwards to keep their overall space usage under control. > > > > And then > > ...we reject / push back on that patch citing the above concern. So at what point do we draw the line? We're allowing writable DAX mappings, but as I've pointed out that means we are going to be allowing a potential information leak via files with shared extents to be directly mapped and written to. But we won't allow useful admin operations that allow better management of host side storage space similar to how normal image files are used by guests because it's an information leak vector? That's splitting some really fine hairs there... > > > In case of virtio-pmem & DAX, guest clears guest page cache exceptional > > > entries. > > > Its solely decision of host to take action on the host page cache pages. > > > > > > In case of virtio-pmem, guest does not modify host file directly i.e don't > > > perform hole punch & truncation operation directly on host file. > > > > ... this will no longer be true, and the nuclear landmine in this > > driver interface will have been armed > > I agree with the need to be careful when / if explicit cache control > is added, but that's not the case today. "if"? I expect it to be "when", not if. Expect the worst, plan for it now. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v3 0/5] kvm "virtio pmem" device
On Mon, Jan 14, 2019 at 02:15:40AM -0500, Pankaj Gupta wrote: > > > > Until you have images (and hence host page cache) shared between > > > multiple guests. People will want to do this, because it means they > > > only need a single set of pages in host memory for executable > > > binaries rather than a set of pages per guest. Then you have > > > multiple guests being able to detect residency of the same set of > > > pages. If the guests can then, in any way, control eviction of the > > > pages from the host cache, then we have a guest-to-guest information > > > leak channel. > > > > I don't think we should ever be considering something that would allow a > > guest to evict page's from the host's pagecache [1]. The guest should > > be able to kick its own references to the host's pagecache out of its > > own pagecache, but not be able to influence whether the host or another > > guest has a read-only mapping cached. > > > > [1] Unless the guest is allowed to modify the host's file; obviously > > truncation, holepunching, etc are going to evict pages from the host's > > page cache. > > This is so correct. Guest does not not evict host page cache pages directly. They don't right now. But someone is going to end up asking for discard to work so that the guest can free unused space in the underlying spares image (i.e. make use of fstrim or mount -o discard) because they have workloads that have bursts of space usage and they need to trim the image files afterwards to keep their overall space usage under control. And then > In case of virtio-pmem & DAX, guest clears guest page cache exceptional > entries. > Its solely decision of host to take action on the host page cache pages. > > In case of virtio-pmem, guest does not modify host file directly i.e don't > perform hole punch & truncation operation directly on host file. ... this will no longer be true, and the nuclear landmine in this driver interface will have been armed Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v3 0/5] kvm "virtio pmem" device
On Fri, Jan 11, 2019 at 02:45:04AM -0500, Pankaj Gupta wrote: > > > > > On Wed, Jan 09, 2019 at 08:17:31PM +0530, Pankaj Gupta wrote: > > > This patch series has implementation for "virtio pmem". > > > "virtio pmem" is fake persistent memory(nvdimm) in guest > > > which allows to bypass the guest page cache. This also > > > implements a VIRTIO based asynchronous flush mechanism. > > > > H. Sharing the host page cache direct into the guest VM. Sounds > > like a good idea, but. > > > > This means the guest VM can now run timing attacks to observe host > > side page cache residency, and depending on the implementation I'm > > guessing that the guest will be able to control host side page > > cache eviction, too (e.g. via discard or hole punch operations). > > Not sure how? this is similar to mmapping virtual memory by any userspace > process. Any host userspace process can do such attack on host page cache > using mincore & mmap shared file. Mincore is for monitoring, not cached eviction. And it's not required to observe cache residency, either. That's a wide open field containing an uncountable number of moles... > But i don't think guest can do this alone. For virtio-pmem usecase > guest won't be using page cache so timing attack from only guest > side is not possible unless host userspace can run checks on page > cache eviction state using mincore etc. As rightly described by > Rik, guest will only access its own page cache pages and if guest > page cache is managed directly by host, this saves alot of effort > for guest in transferring guest state of page cache. Until you have images (and hence host page cache) shared between multiple guests. People will want to do this, because it means they only need a single set of pages in host memory for executable binaries rather than a set of pages per guest. Then you have multiple guests being able to detect residency of the same set of pages. If the guests can then, in any way, control eviction of the pages from the host cache, then we have a guest-to-guest information leak channel. i.e. it's something we need to be aware of and really careful about enabling infrastructure that /will/ be abused if guests can find a way to influence the host side cache residency. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v3 0/5] kvm "virtio pmem" device
On Sun, Jan 13, 2019 at 03:38:21PM -0800, Matthew Wilcox wrote: > On Mon, Jan 14, 2019 at 10:29:02AM +1100, Dave Chinner wrote: > > Until you have images (and hence host page cache) shared between > > multiple guests. People will want to do this, because it means they > > only need a single set of pages in host memory for executable > > binaries rather than a set of pages per guest. Then you have > > multiple guests being able to detect residency of the same set of > > pages. If the guests can then, in any way, control eviction of the > > pages from the host cache, then we have a guest-to-guest information > > leak channel. > > I don't think we should ever be considering something that would allow a > guest to evict page's from the host's pagecache [1]. The guest should > be able to kick its own references to the host's pagecache out of its > own pagecache, but not be able to influence whether the host or another > guest has a read-only mapping cached. > > [1] Unless the guest is allowed to modify the host's file; obviously > truncation, holepunching, etc are going to evict pages from the host's > page cache. Right, and that's exactly what I mean by "we need to be real careful with functionality like this". To be honest, I really don't think I've even touched the surface here. e.g. Filesystems and storage can share logical and physical extents. Which means that image files that share storage (e.g. because they are all cloned from the same master image and/or there's in-line deduplication running on the storage) and can be directly accessed by guests may very well be susceptible to detection of host side deduplication and subsequent copy-on-write operations. This really doesn't seem much different to me from the guest being able to infer host side KSM page deduplication and COW operation in the guest side page cache. The only difference is that DAX is being used to probe the host side page cache and storage rather than the guest side. IOWs, I suspect there's a world of pain waiting for us if we punch huge holes through the virtual machine abstractions like this. Improving performance is a laudible goal, but at what price? Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v3 0/5] kvm "virtio pmem" device
On Tue, Jan 15, 2019 at 12:35:06AM -0500, Pankaj Gupta wrote: > > > > > On Mon, Jan 14, 2019 at 02:15:40AM -0500, Pankaj Gupta wrote: > > > > > > > > > > > > Until you have images (and hence host page cache) shared between > > > > > > > multiple guests. People will want to do this, because it means > > > > > > > they > > > > > > > only need a single set of pages in host memory for executable > > > > > > > binaries rather than a set of pages per guest. Then you have > > > > > > > multiple guests being able to detect residency of the same set of > > > > > > > pages. If the guests can then, in any way, control eviction of the > > > > > > > pages from the host cache, then we have a guest-to-guest > > > > > > > information > > > > > > > leak channel. > > > > > > > > > > > > I don't think we should ever be considering something that would > > > > > > allow a > > > > > > guest to evict page's from the host's pagecache [1]. The guest > > > > > > should > > > > > > be able to kick its own references to the host's pagecache out of > > > > > > its > > > > > > own pagecache, but not be able to influence whether the host or > > > > > > another > > > > > > guest has a read-only mapping cached. > > > > > > > > > > > > [1] Unless the guest is allowed to modify the host's file; obviously > > > > > > truncation, holepunching, etc are going to evict pages from the > > > > > > host's > > > > > > page cache. > > > > > > > > > > This is so correct. Guest does not not evict host page cache pages > > > > > directly. > > > > > > > > They don't right now. > > > > > > > > But someone is going to end up asking for discard to work so that > > > > the guest can free unused space in the underlying spares image (i.e. > > > > make use of fstrim or mount -o discard) because they have workloads > > > > that have bursts of space usage and they need to trim the image > > > > files afterwards to keep their overall space usage under control. > > > > > > > > And then > > > > > > ...we reject / push back on that patch citing the above concern. > > > > So at what point do we draw the line? > > > > We're allowing writable DAX mappings, but as I've pointed out that > > means we are going to be allowing a potential information leak via > > files with shared extents to be directly mapped and written to. > > > > But we won't allow useful admin operations that allow better > > management of host side storage space similar to how normal image > > files are used by guests because it's an information leak vector? > > First of all Thank you for all the useful discussions. > I am summarizing here: > > - We have to live with the limitation to not support fstrim and > mount -o discard options with virtio-pmem as they will evict > host page cache pages. We cannot allow this for virtio-pmem > for security reasons. These filesystem commands will just zero out > unused pages currently. Not sure I follow you here - what pages are going to be zeroed and when will they be zeroed? If discard is not allowed, filesystems just don't issue such commands and the underlying device will never seen them. > - If alot of space is unused and not freed guest can request host > Administrator for truncating the host backing image. You can't use truncate to free space in a disk image file. The only way to do it safely in a generic, filesystem agnositic way is to mount the disk image (e.g. on loopback) and run fstrim on it. The loopback device will punches holes in the file where all the free space is reported by the filesystem via discard requests. Which is kinda my point - this could only be done if the guest is shut down, which makes it very difficult for admins to manage. > We are also planning to support qcow2 sparse image format at > host side with virtio-pmem. So you're going to be remapping a huge number of disjoint regions into a linear pmem mapping? ISTR discussions about similar things for virtio+fuse+dax that came up against "large numbers of mapped regions don't scale" and so it wasn't a practical solution compared to a just using raw sparse files > - There is no existing solution for Qemu persistent memory > emulation with write support currently. This solution provides > us the paravartualized way of emulating persistent memory. Sure, but the question is why do you need to create an emulation that doesn't actually perform like pmem? The whole point of pmem is performance, and emulating pmem by mmap() of a file on spinning disks is going to be horrible for performance. Even on SSDs it's going to be orders of magnitudes slower than real pmem. So exactly what problem are you trying to solve with this driver? Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v3 0/5] kvm "virtio pmem" device
On Wed, Jan 09, 2019 at 08:17:31PM +0530, Pankaj Gupta wrote: > This patch series has implementation for "virtio pmem". > "virtio pmem" is fake persistent memory(nvdimm) in guest > which allows to bypass the guest page cache. This also > implements a VIRTIO based asynchronous flush mechanism. H. Sharing the host page cache direct into the guest VM. Sounds like a good idea, but. This means the guest VM can now run timing attacks to observe host side page cache residency, and depending on the implementation I'm guessing that the guest will be able to control host side page cache eviction, too (e.g. via discard or hole punch operations). Which means this functionality looks to me like a new vector for information leakage into and out of the guest VM via guest controlled host page cache manipulation. https://arxiv.org/pdf/1901.01161 I might be wrong, but if I'm not we're going to have to be very careful about how guest VMs can access and manipulate host side resources like the page cache. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v1 2/2] block: virtio-blk: support multi virt queues per virtio-blk device
On Sun, Jun 22, 2014 at 01:24:48PM +0300, Michael S. Tsirkin wrote: > On Fri, Jun 20, 2014 at 11:29:40PM +0800, Ming Lei wrote: > > @@ -24,8 +26,8 @@ static struct workqueue_struct *virtblk_wq; > > struct virtio_blk > > { > > struct virtio_device *vdev; > > - struct virtqueue *vq; > > - spinlock_t vq_lock; > > + struct virtqueue *vq[MAX_NUM_VQ]; > > + spinlock_t vq_lock[MAX_NUM_VQ]; > > array of struct { > *vq; > spinlock_t lock; > } > would use more memory but would get us better locality. > It might even make sense to add padding to avoid > cacheline sharing between two unrelated VQs. > Want to try? It's still false sharing because the queue objects share cachelines. To operate without contention they have to be physically separated from each other like so: struct vq { struct virtqueue*q; spinlock_t lock; } cacheline_aligned_in_smp; struct some_other_struct { struct vq vq[MAX_NUM_VQ]; }; This keeps locality to objects within a queue, but separates each queue onto it's own cacheline Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [patch 0/4] [RFC] Another proportional weight IO controller
On Fri, Nov 07, 2008 at 11:31:44AM +0100, Peter Zijlstra wrote: > On Fri, 2008-11-07 at 11:41 +1100, Dave Chinner wrote: > > On Thu, Nov 06, 2008 at 06:11:27PM +0100, Peter Zijlstra wrote: > > > On Thu, 2008-11-06 at 11:57 -0500, Rik van Riel wrote: > > > > Peter Zijlstra wrote: > > > > > > > > > The only real issue I can see is with linear volumes, but > > > > > those are stupid anyway - non of the gains but all the > > > > > risks. > > > > > > > > Linear volumes may well be the most common ones. > > > > > > > > People start out with the filesystems at a certain size, > > > > increasing onto a second (new) disk later, when more space > > > > is required. > > > > > > Are they aware of how risky linear volumes are? I would > > > discourage anyone from using them. > > > > In what way are they risky? > > You loose all your data when one disk dies, so your mtbf decreases > with the number of disks in your linear span. And you get non of > the benefits from having multiple disks, like extra speed from > striping, or redundancy from raid. Fmeh. Step back and think for a moment. How does every major distro build redundant root drives? Yeah, they build a mirror and then put LVM on top of the mirror to partition it. Each partition is a *linear volume*, but no single disk failure is going to lose data because it's been put on top of a mirror. IOWs, reliability of linear volumes is only an issue if you don't build redundancy into your storage stack. Just like RAID0, a single disk failure will lose data. So, most people use linear volumes on top of RAID1 or RAID5 to avoid such a single disk failure problem. People do the same thing with RAID0 - it's what RAID10 and RAID50 do Also, linear volume performance scalability is on a different axis to striping. Striping improves bandwidth, but each disk in a stripe tends to make the same head movements. Hence striping improves sequential throughput but only provides limited iops scalability. Effectively, striping only improves throughput while the disks are not seeking a lot. Add a few parallel I/O streams, and a stripe will start to slow down as each disk seeks between streams. i.e. disks in stripes cannot be considered to be able to operate independently. Linear voulmes create independent regions within the address space - the regions can seek independently when under concurrent I/O and hence iops scalability is much greater. Aggregate bandwidth is the same a striping, it's just that a single stream is limited in throughput. If you want to improve single stream throughput, you stripe before you concatenate. That's why people create layered storage systems like this: linear volume |->stripe |-> md RAID5 |-> disk |-> disk |-> disk |-> disk |-> disk |-> md RAID5 |-> disk |-> disk |-> disk |-> disk |-> disk |->stripe |-> md RAID5 .. |->stripe .. What you then need is a filesystem that can spread the load over such a layout. Lets use, for argument's sake, XFS and tell it the geometry of the RAID5 luns that make up the volume so that it's allocation is all nicely aligned. Then we match the allocation group size to the size of each independent part of the linear volume. Now when XFS spreads it's inodes and data over multiple AGs, it's spreading the load across disks that can operate concurrently Effectively, linear volumes are about as dangerous as striping. If you don't build in redundancy at a level below the linear volume or stripe, then you lose when something fails. Cheers, Dave. -- Dave Chinner [EMAIL PROTECTED] ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [patch 0/4] [RFC] Another proportional weight IO controller
On Thu, Nov 06, 2008 at 06:11:27PM +0100, Peter Zijlstra wrote: > On Thu, 2008-11-06 at 11:57 -0500, Rik van Riel wrote: > > Peter Zijlstra wrote: > > > > > The only real issue I can see is with linear volumes, but those are > > > stupid anyway - non of the gains but all the risks. > > > > Linear volumes may well be the most common ones. > > > > People start out with the filesystems at a certain size, > > increasing onto a second (new) disk later, when more space > > is required. > > Are they aware of how risky linear volumes are? I would discourage > anyone from using them. In what way are they risky? Cheers, Dave. -- Dave Chinner [EMAIL PROTECTED] ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization